MS word doc containing tracked changes produces incorrect text
--------------------------------------------------------------

                 Key: TIKA-207
                 URL: https://issues.apache.org/jira/browse/TIKA-207
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.3
         Environment: tika-0.3-standalone.jar
            Reporter: Michael McCandless
            Priority: Minor


Spinoff from this discussion:

  
http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html

When extracting text from an MS Word doc (2003 format) that has
unapproved pending changes, the text from both old and new is glommed
together.

EG I had a doc that contained text "Field.Index.TOKENIZED", and I
changed TOKENIZED to ANALYZED with track changes enabled, and
then when I extract text (using TikaCLI) it produces this:

  Field.Index.TOKENIZEDANALYZED

So, first, it'd be nice to at least get whitespace inserted between
old & new text.

And, second, it'd be great to have an option to control whether it's
old or new text that's indexed (or at least an option to only see
"new" text, ie the current document).

>From the discussion above, it seems like POI may expose the
fine-grained APIs to allow Tika to do this; it's just that Tika's not
leveraging these APIs  for MS Word docs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to