You can turn off the extraction of deleted text via the
OfficeParserConfig#setIncludeDeletedContent

However, I agree that it would be an improvement to add div tags for
deleted text.  I haven’t been in this part of the codebase in a while. It
_might_ be fairly trivial to add.


On Fri, Aug 27, 2021 at 10:34 AM Peter Kronenberg <[email protected]>
wrote:

> When Tika extracts from a Microsoft Word document, deleted text is
> extracted, with no indication that it is deleted.  In fact, if a word was
> deleted and replaced by another word, both words just show up
> side-by-side.  Is there a way to get some sort of annotation that indicates
> the status of the text?  Or extract it in some sort of structured (e.g.,
> XML) format?  Similarly for highlighted text or other mark-up.  Any way to
> get that?
>
>
>
> For example
>
>
>
> *Time of Essence* was changed *Time of Importance*
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>

Reply via email to