Extracting Indices When Tokenizing

Adam Goodkind Wed, 12 Sep 2012 19:27:27 -0700

Hi,

When tokenizing a string of text, is there also a way to track the index (of 
the original text) where the token begins?


For example:
"Mary didn't kiss John"
[(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)]

If there is a way to extract the 0, 5, 8, 12 and 17 from somewhere, that would 
be great. I cannot rely on whitespace, since the tokenizer sometimes breaks up 
words.

Thanks,
Adam

Extracting Indices When Tokenizing

Reply via email to