colin created TIKA-1615:
---------------------------

             Summary: Html fragments with comments before div elements are not 
been detected as html
                 Key: TIKA-1615
                 URL: https://issues.apache.org/jira/browse/TIKA-1615
             Project: Tika
          Issue Type: Bug
          Components: detector
    Affects Versions: 1.7
            Reporter: colin


We are trying to import html fragments into Solr.

The below is not being detected as html

<!-- test -->
<div>
 test
</div>

When the comment is removed the fragment is being parsed as html, this 
functionality was added by https://issues.apache.org/jira/browse/TIKA-1102

To work around this, we added 

<root-XML localName="div"/>
<root-XML localName="DIV"/>

to the <mime-type type="text/html"> element in tika-mimetypes.xml

The fragment is then parsed as expected









--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to