bug in CompositeParser.getParser function -----------------------------------------
Key: TIKA-414 URL: https://issues.apache.org/jira/browse/TIKA-414 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.7 Reporter: Piotr B. I've upgraded tika in my project to 0.7. After that for many html documents AutoDetectParser wrongly choses fallback parser instead of HtmlParser. Example of problematic html input: <html> <head> <title>test</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body>test</body> </html> In this example AutoDetectParser sets Metadata.CONTENT_TYPE to "text/html; charset=utf-8", but there is no parser registered for that string. The solution is to fix getParser function in CompositeParser so as not to consider content type parameters (cut off the string from ';' to the end). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.