Expose characters offsets information while parsing text-based inputs. ----------------------------------------------------------------------
Key: TIKA-272 URL: https://issues.apache.org/jira/browse/TIKA-272 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 0.4 Reporter: David Causse Priority: Minor It would be interesting to access actual characters offset information when parsing text-based files (I don't know if it's interesting/usable/doable for binary formats...). If I use tika for parsing HTML and inject parsed strings into lucene, I'm not able to tell to the lucene analyzer where is the actual character in the original input. If tika expose this information It will permit to use unmodified lucene analyzers behind tika and implement for example pretty highlighting in search result (see google cache view). With new Lucene Attribute API it could be fairly easy to provide a sort of TikaOffsetRectifierTokenFilter in lucene contrib and use a stack like tika -> unmodified lucene analyzer -> tika offset correction. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.