[ https://issues.apache.org/jira/browse/TIKA-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839458#action_12839458 ]
Dave Meikle commented on TIKA-385: ---------------------------------- This is the default behaviour in POI - see XWPFHyperlinkDecorator[0]. This currently has a TODO for outputting the link in the correct location, so we could have a go at fixing that in POI to correct it, or indeed just implement our own in the meantime. Does anyone know of a reason as to why it would just be <link_url> in the output? For internal links maybe? Cheers, Dave [0] http://poi.apache.org/apidocs/org/apache/poi/xwpf/model/XWPFHyperlinkDecorator.html > Incorrect handling of hyperlinks in .docx > ----------------------------------------- > > Key: TIKA-385 > URL: https://issues.apache.org/jira/browse/TIKA-385 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.6 > Environment: Linux, java version "1.6.0_17", Java(TM) SE Runtime > Environment (build 1.6.0_17-b04) > Reporter: Liam O'Boyle > Attachments: Internal_Search_Test.docx > > > Hyperlinks are incorrectly parsed in at least some office 2007 word files. > The attached file is one example. > There are two problems with the handling > - an incorrectly formatted link is generated, instead of <a > href="http://somewhere"> you get <http://somewhere> > - the link is in the incorrect location in the extracted text; the links in > the attached document end up at the end of the paragraph that they were > originally in the middle of > Both of these issues cause problems later on when using Tika with Solr > - the incorrect links are not picked up by the built in HTML filter classes > - the garbled text order creates unhelpful snippets when highlighting -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.