[ https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Curtis Warner updated TIKA-405: ------------------------------- Attachment: WordDocWithLinksAndTable.doc expected.txt actual.txt > Problems handling Hyperlinks and Tables in Word 97 Docs > ------------------------------------------------------- > > Key: TIKA-405 > URL: https://issues.apache.org/jira/browse/TIKA-405 > Project: Tika > Issue Type: Bug > Affects Versions: 0.7 > Environment: 32-bit Ubuntu Linux > Reporter: Curtis Warner > Attachments: actual.txt, expected.txt, WordDocWithLinksAndTable.doc > > > I discovered some odd behavior while running a three-way comparison test > between Tika, Aperture, and Autonomy KeyView. The input file was a test Word > 97 Doc (attached) including a paragraph peppered with hyperlinks and a table > filled with dummy text. KeyView generated the full text, as I expected. > Aperture and Tika had identical results to one another (barring one lost > whitespace character), but their outputs yielded significantly fewer tokens > than KeyView's did. I've attached the output text from KeyView and Tika for > reference. > There are two distinct problems I recognized in Tika's text output: > 1) Hyperlinks from the Word Doc aren't included in the output text. They > appear to have been skipped completely. > 2) The values in the Word Doc's table are conglomerated all together into a > single blob rather than being emitted separately, which ruins any attempt at > tokenizing the table's contents. > Seeing as both Tika and Aperture had exactly the same issues with this test > file, my guess is that it's a problem with the shared POI library. I thought > it would be worth noting, though, in case there's an easy fix on the Tika end > of things. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira