[ 
https://issues.apache.org/jira/browse/TIKA-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839458#action_12839458
 ] 

Dave Meikle commented on TIKA-385:
----------------------------------

This is the default behaviour in POI - see XWPFHyperlinkDecorator[0].  This 
currently has a TODO for outputting the link in the correct location, so we 
could have a go at fixing that in POI to correct it, or indeed just implement 
our own in the meantime.

Does anyone know of a reason as to why it would just be <link_url> in the 
output?  For internal links maybe?

Cheers,
Dave

[0] 
http://poi.apache.org/apidocs/org/apache/poi/xwpf/model/XWPFHyperlinkDecorator.html

> Incorrect handling of hyperlinks in .docx
> -----------------------------------------
>
>                 Key: TIKA-385
>                 URL: https://issues.apache.org/jira/browse/TIKA-385
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>         Environment: Linux, java version "1.6.0_17", Java(TM) SE Runtime 
> Environment (build 1.6.0_17-b04)
>            Reporter: Liam O'Boyle
>         Attachments: Internal_Search_Test.docx
>
>
> Hyperlinks are incorrectly parsed in at least some office 2007 word files.  
> The attached file is one example. 
> There are two problems with the handling
>  - an incorrectly formatted link is generated, instead of <a 
> href="http://somewhere";> you get <http://somewhere>
>  - the link is in the incorrect location in the extracted text; the links in 
> the attached document end up at the end of the paragraph that they were 
> originally in the middle of
> Both of these issues cause problems later on when using Tika with Solr
>  - the incorrect links are not picked up by the built in HTML filter classes
>  - the garbled text order creates unhelpful snippets when highlighting

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to