[jira] Updated: (TIKA-385) Incorrect handling of hyperlinks in .docx

Liam O'Boyle (JIRA) Fri, 26 Feb 2010 21:50:31 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Liam O'Boyle updated TIKA-385:
------------------------------

    Description: 
Hyperlinks are incorrectly parsed in at least some office 2007 word files.  The 
attached file is one example. 

There are two problems with the handling
 - an incorrectly formatted link is generated, instead of <a 
href="http://somewhere";> you get <http://somewhere>
 - the link is in the incorrect location in the extracted text; the links in 
the attached document end up at the end of the paragraph that they were 
originally in the middle of

Both of these issues cause problems later on when using Tika with Solr
 - the incorrect links are not picked up by the built in HTML filter classes
 - the garbled text order creates unhelpful snippets when highlighting

  was:
Hyperlinks are incorrectly parsed in at least some office 2007 word files.  The 
attached file is one example. 

There are two problems with the handling
 - an incorrectly formatted link is generated, instead of <a 
href="http://somewhere";> you get <http://somewhere";>
 - the link is in the incorrect location in the extracted text; the links in 
the attached document end up at the end of the paragraph that they were 
originally in the middle of

Both of these issues cause problems later on when using Tika with Solr
 - the incorrect links are not picked up by the built in HTML filter classes
 - the garbled text order creates unhelpful snippets when highlighting


> Incorrect handling of hyperlinks in .docx
> -----------------------------------------
>
>                 Key: TIKA-385
>                 URL: https://issues.apache.org/jira/browse/TIKA-385
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>         Environment: Linux, java version "1.6.0_17", Java(TM) SE Runtime 
> Environment (build 1.6.0_17-b04)
>            Reporter: Liam O'Boyle
>         Attachments: Internal_Search_Test.docx
>
>
> Hyperlinks are incorrectly parsed in at least some office 2007 word files.  
> The attached file is one example. 
> There are two problems with the handling
>  - an incorrectly formatted link is generated, instead of <a 
> href="http://somewhere";> you get <http://somewhere>
>  - the link is in the incorrect location in the extracted text; the links in 
> the attached document end up at the end of the paragraph that they were 
> originally in the middle of
> Both of these issues cause problems later on when using Tika with Solr
>  - the incorrect links are not picked up by the built in HTML filter classes
>  - the garbled text order creates unhelpful snippets when highlighting

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-385) Incorrect handling of hyperlinks in .docx

Reply via email to