encoding detected by HtmlParser

reinhard schwab Thu, 30 Sep 2010 05:17:42 -0700

how is tika reasoning about the encodings in html files?
im asking because i have to parse files which have wrong encodings in
the html header.


example:
http://www.brz.gv.at/Portal.Node/brz/public/content/aktuelles/pressemeldungen/41263.html

<meta http-equiv="content-type" content="application/xhtml+xml;
charset=iso-8859-1" />

in real, the encoding is UTF-8.
this is also the encoding provided by the http response header.

looking at the code in HtmlParser,
the method
private String getEncoding(InputStream stream, Metadata metadata)
tries to identify the encoding by checking the meta tags.
if it finds an encoding there, it returns this encoding.

i set the content type and also the content encoding in tika metadata to
bias the HtmlParser, but this seems to be ignored first.
it is only used later, when no encoding is found in meta tags.

so how will tika in future handle such situations when
a/ the encoding in meta tag is wrong
b/ the encoding in http response header is ok and different from the one
in meta tag

regards
reinhard

encoding detected by HtmlParser

Reply via email to