> It _might_ be fairly trivial to add.

It isn’t at least for docx. The challenge iirc is that the deleted content
flag can be on a run, a paragraph, table, etc. You’d have to add deleted
checks on every element.

So, definitely possible, but non-trivial.

In addition to deleted text, there’s also “movefrom” text which we should
handle at the same time if we fix this for deleted text.

Finally, it looks like includeDeletedContent may not work correctly for
docx. :( I’d need to check when back to keyboard.

If this is important for you, please open an issue with example documents.

On Sat, Aug 28, 2021 at 9:13 AM Tim Allison <[email protected]> wrote:

>
> You can turn off the extraction of deleted text via the
> OfficeParserConfig#setIncludeDeletedContent
>
> However, I agree that it would be an improvement to add div tags for
> deleted text.  I haven’t been in this part of the codebase in a while. It
> _might_ be fairly trivial to add.
>
>
> On Fri, Aug 27, 2021 at 10:34 AM Peter Kronenberg <
> [email protected]> wrote:
>
>> When Tika extracts from a Microsoft Word document, deleted text is
>> extracted, with no indication that it is deleted.  In fact, if a word was
>> deleted and replaced by another word, both words just show up
>> side-by-side.  Is there a way to get some sort of annotation that indicates
>> the status of the text?  Or extract it in some sort of structured (e.g.,
>> XML) format?  Similarly for highlighted text or other mark-up.  Any way to
>> get that?
>>
>>
>>
>> For example
>>
>>
>>
>> *Time of Essence* was changed *Time of Importance*
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623*
>>
>> [image: Torch AI] <http://www.torch.ai/>
>>
>> 4303 W. 119th St., Leawood, KS 66209
>> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
>> WWW.TORCH.AI <http://www.torch.ai/>
>>
>>
>>
>>
>>
>

Reply via email to