You can turn off the extraction of deleted text via the OfficeParserConfig#setIncludeDeletedContent
However, I agree that it would be an improvement to add div tags for deleted text. I haven’t been in this part of the codebase in a while. It _might_ be fairly trivial to add. On Fri, Aug 27, 2021 at 10:34 AM Peter Kronenberg <[email protected]> wrote: > When Tika extracts from a Microsoft Word document, deleted text is > extracted, with no indication that it is deleted. In fact, if a word was > deleted and replaced by another word, both words just show up > side-by-side. Is there a way to get some sort of annotation that indicates > the status of the text? Or extract it in some sort of structured (e.g., > XML) format? Similarly for highlighted text or other mark-up. Any way to > get that? > > > > For example > > > > *Time of Essence* was changed *Time of Importance* > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] <http://www.torch.ai/> > > 4303 W. 119th St., Leawood, KS 66209 > <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g> > WWW.TORCH.AI <http://www.torch.ai/> > > > > >
