The regex trick is nice! On Thu, Jul 19, 2012 at 9:52 AM, Jamey Wood <[email protected]> wrote: > Thank you both, Lance and Jeyendran. I am using a post-processing approach > along the lines of what you've suggested. I just wanted to be sure there > wasn't some better practice that I was overlooking. > > Thanks, > Jamey > > On Thu, Jul 19, 2012 at 10:42 AM, Jeyendran Balakrishnan < > [email protected]> wrote: > >> For your particular use case of detecting URLs, another way is to >> preprocess your sentence with a custom URL regex detector, storing the >> detected URLs in a hash map, replacing the detected URLS in the sentence >> with their hash (or even something like "URL1", "URL2" etc , [which should >> not occur naturally in your text]), then run it through the opennlp >> tokenizer, then postprocess the resulting tokens to replace each hash >> occurrence with the corresponding URLs from the hashmap. The idea is that >> the replacement values inserted during processing would come out of the >> tokenizer as a separate token, so they can easily be replaced by their >> corresponding URLs extracted by the regex. Since the tokenizer operates >> per-sentence, the hashmap size will be small. >> This approach can be used for any regex based token detector, like for >> emails, decimal numbers, etc. >> >> -Jeyendran >> >> >> -----Original Message----- >> From: Lance Norskog [mailto:[email protected]] >> Sent: Wednesday, July 18, 2012 11:33 PM >> To: [email protected] >> Subject: Re: Augmenting TokenizerME >> >> I would post-process the output and hunt for urls and rebuild them. >> >> I believe the statistical models are not fungible. More important, these >> are statistical models and have an error rate. You can do a much better job >> by putting together the pieces after the tokenizer takes them apart. >> >> On Wed, Jul 18, 2012 at 11:34 AM, Jamey Wood <[email protected]> wrote: >> > Is there some way to augment a TokenizerME object without having to >> > start with your own full set of training data? For example, we run >> > into cases where a TokenizerME with the standard "en-token.bin" data >> > performs mostly well for us, but does not do a good job with inline >> > URLs that are common in the text we're using. (In most cases, it'll >> split these up so that " >> > http://whatever.com" becomes something like [ "http", ":", "/", "/", >> > "whatever", "com" ].) >> > >> > Is there some way that we can continue using TokenizerME and the >> > standard "en-token.bin" model, but augment it with our own logic to >> > detect and tokenize URLs? Or would we need to go all the way down to >> > the model training level and come up with our own replacement for >> en-token.bin? >> > >> > Thanks, >> > Jamey >> >> >> >> -- >> Lance Norskog >> [email protected] >> >>
-- Lance Norskog [email protected]
