Problems handling Hyperlinks and Tables in Word 97 Docs
-------------------------------------------------------

                 Key: TIKA-405
                 URL: https://issues.apache.org/jira/browse/TIKA-405
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.7
         Environment: 32-bit Ubuntu Linux
            Reporter: Curtis Warner


I discovered some odd behavior while running a three-way comparison test 
between Tika, Aperture, and Autonomy KeyView. The input file was a test Word 97 
Doc (attached) including a paragraph peppered with hyperlinks and a table 
filled with dummy text. KeyView generated the full text, as I expected. 
Aperture and Tika had identical results to one another (barring one lost 
whitespace character), but their outputs yielded significantly fewer tokens 
than KeyView's did. I've attached the output text from KeyView and Tika for 
reference.

There are two distinct problems I recognized in Tika's text output:

1) Hyperlinks from the Word Doc aren't included in the output text. They appear 
to have been skipped completely.

2) The values in the Word Doc's table are conglomerated all together into a 
single blob rather than being emitted separately, which ruins any attempt at 
tokenizing the table's contents.

Seeing as both Tika and Aperture had exactly the same issues with this test 
file, my guess is that it's a problem with the shared POI library. I thought it 
would be worth noting, though, in case there's an easy fix on the Tika end of 
things.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to