Hello, you need to use OpenNLP via its API, the tokenizer has a tokenizePos method which returns the spans of the detected tokens.
Have a look at our documentation: http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer.api We do not support this in the command line interface. Hope that helps, Jörn On 09/13/2012 04:26 AM, Adam Goodkind wrote:
Hi, When tokenizing a string of text, is there also a way to track the index (of the original text) where the token begins? For example: "Mary didn't kiss John" [(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)] If there is a way to extract the 0, 5, 8, 12 and 17 from somewhere, that would be great. I cannot rely on whitespace, since the tokenizer sometimes breaks up words. Thanks, Adam
