I've been working with OpenNLP sporadically over the years, and I am now upgrading to the current version. In doing so, I stumbled across some very odd (and undocumented) behavior.
Specifically, the Spans generated from NameFinderME.find() have a start and end index that correspond to the index of the Token, not the character. OK - I can handle this. However, Span.getCoveredText(String text) supposedly gets the text covered by the span - e.g., the actual entity found. However, this method uses the start and end indexes - which correspond to the Token, not the character index - to perform a substring operation. This created incorrect results. For instance, in the sentence (using the standard models in English here) on the following (tokenized) sentence: 10 people were killed in Orchard Road on 12 May 2014. Generates spans for location (start=5, end=7) and date (start=8, end=11). When you call getCoveredText on each of these spans, you should expect the following to be returned 10 people were killed in Orchard Road on 12 May 2014. But instead, because it uses the token index as a character index, the following is actually returned: 10 people were killed in Orchard Road on 12 May 2014. This seems to be an inconsistency, and should either be fixed or at least documented. Edward Swing Applied Research Technologist Vision Systems + Technology, Inc., a SAS Company 6021 University Boulevard * Suite 360 * Ellicott City * Maryland * 21043 Tel: 410.418.5555 Ext: 919 * Fax: 410.418.8580 Email: [email protected]<mailto:[email protected]> Web: http://www.vsticorp.com<http://www.vsticorp.com/>
