[jira] Resolved: (TIKA-377) Error parsing HTML partial with AutoDetect parser

Jukka Zitting (JIRA) Wed, 10 Feb 2010 08:11:50 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-377.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.7
         Assignee: Jukka Zitting

In cases like this the Tika type detection code is fooled into thinking that 
the document is XML, and obviously any draconian XML parser will reject such 
documents.

In revisions 908554 and 908560 I added some more heuristics to Tika for better 
detecting such tag soup HTML. With these changes the attached test document is 
correctly recognized as HTML and parsed with the lenient HTML parser.

> Error parsing HTML partial with AutoDetect parser
> -------------------------------------------------
>
>                 Key: TIKA-377
>                 URL: https://issues.apache.org/jira/browse/TIKA-377
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Brett S.
>            Assignee: Jukka Zitting
>             Fix For: 0.7
>
>         Attachments: test.html
>
>
> I get the following error parsing a html file containing a partial HTML 
> document.  
> TIKA-237: Illegal SAXException from 
> org.apache.tika.parser.xml.dcxmlpar...@3a43af 
> The following conditions need to exist in the file for the error to be thrown:
> + A HTML comment before any HTML tags
> + More than one top level HTML tag
> I will attach a test file to reproduce

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-377) Error parsing HTML partial with AutoDetect parser

Reply via email to