Hi,

When tokenizing a string of text, is there also a way to track the index (of 
the original text) where the token begins?

For example:
"Mary didn't kiss John"
[(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)]

If there is a way to extract the 0, 5, 8, 12 and 17 from somewhere, that would 
be great. I cannot rely on whitespace, since the tokenizer sometimes breaks up 
words.

Thanks,
Adam

Reply via email to