bug in CompositeParser.getParser function
-----------------------------------------

                 Key: TIKA-414
                 URL: https://issues.apache.org/jira/browse/TIKA-414
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
            Reporter: Piotr B.


I've upgraded tika in my project to 0.7.
After that for many html documents AutoDetectParser wrongly choses fallback 
parser instead of HtmlParser.

Example of problematic html input:

<html>
<head>
<title>test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>test</body>
</html>


In this example AutoDetectParser sets Metadata.CONTENT_TYPE to "text/html; 
charset=utf-8",
but there is no parser registered for that string.

The solution is to fix getParser function in CompositeParser so as not to 
consider content type parameters (cut off the string from ';' to the end).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to