On Fri, 27 Aug 2021, Peter Kronenberg wrote:
When Tika extracts from a Microsoft Word document, deleted text is extracted, with no indication that it is deleted. In fact, if a word was deleted and replaced by another word, both words just show up side-by-side. Is there a way to get some sort of annotation that indicates the status of the text? Or extract it in some sort of structured (e.g., XML) format?

How are you calling Tika? Is the XHTML output sufficiently marked-up to let you spot it?

Nick

Reply via email to