Thanks. I'll at least try the flag and see if that improves things

________________________________
From: Tim Allison <[email protected]>
Sent: Saturday, August 28, 2021 9:26:01 AM
To: [email protected] <[email protected]>
Subject: Re: Deleted text in Word document

> It _might_ be fairly trivial to add. It isn’t at least for docx. The 
> challenge iirc is that the deleted content flag can be on a run, a paragraph, 
> table, etc. You’d have to add deleted check

> It _might_ be fairly trivial to add.

It isn’t at least for docx. The challenge iirc is that the deleted content flag 
can be on a run, a paragraph, table, etc. You’d have to add deleted checks on 
every element.

So, definitely possible, but non-trivial.

In addition to deleted text, there’s also “movefrom” text which we should 
handle at the same time if we fix this for deleted text.

Finally, it looks like includeDeletedContent may not work correctly for docx. 
:( I’d need to check when back to keyboard.

If this is important for you, please open an issue with example documents.

On Sat, Aug 28, 2021 at 9:13 AM Tim Allison 
<[email protected]<mailto:[email protected]>> wrote:

You can turn off the extraction of deleted text via the 
OfficeParserConfig#setIncludeDeletedContent

However, I agree that it would be an improvement to add div tags for deleted 
text.  I haven’t been in this part of the codebase in a while. It _might_ be 
fairly trivial to add.


On Fri, Aug 27, 2021 at 10:34 AM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:

When Tika extracts from a Microsoft Word document, deleted text is extracted, 
with no indication that it is deleted.  In fact, if a word was deleted and 
replaced by another word, both words just show up side-by-side.  Is there a way 
to get some sort of annotation that indicates the status of the text?  Or 
extract it in some sort of structured (e.g., XML) format?  Similarly for 
highlighted text or other mark-up.  Any way to get that?



For example

[cid:17b8ce17cccad7999131]



Time of Essence was changed Time of Importance



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch 
AI]<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=a23f52deb3ea4590aab506d0d07fcd03>

4303 W. 119th St., Leawood, KS 
66209<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=a23f52deb3ea4590aab506d0d07fcd03>
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=a23f52deb3ea4590aab506d0d07fcd03>




Reply via email to