Hi,
I'm using HtmlParser to parse html pages in the wild. Obviously they
often contain many grave errors. Is there a way to specify a fallback
encoding if the declared one in <meta..> is invalid (e.g. "ISO 8859-1",
note the space instead of dash)?
That's what I'm getting now, using tika 0.7:
java.nio.charset.IllegalCharsetNameException: ISO 8859-1
at java.nio.charset.Charset.checkName(Charset.java:284)
at java.nio.charset.Charset.lookup2(Charset.java:458)
at java.nio.charset.Charset.lookup(Charset.java:437)
at java.nio.charset.Charset.isSupported(Charset.java:479)
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
The relevant portion of the page:
<META http-equiv="Content-Type" content="text/html; charset=ISO 8859-1">
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com