how is tika reasoning about the encodings in html files?
im asking because i have to parse files which have wrong encodings in
the html header.

example:
http://www.brz.gv.at/Portal.Node/brz/public/content/aktuelles/pressemeldungen/41263.html

<meta http-equiv="content-type" content="application/xhtml+xml;
charset=iso-8859-1" />

in real, the encoding is UTF-8.
this is also the encoding provided by the http response header.

looking at the code in HtmlParser,
the method
private String getEncoding(InputStream stream, Metadata metadata)
tries to identify the encoding by checking the meta tags.
if it finds an encoding there, it returns this encoding.

i set the content type and also the content encoding in tika metadata to
bias the HtmlParser, but this seems to be ignored first.
it is only used later, when no encoding is found in meta tags.

so how will tika in future handle such situations when
a/ the encoding in meta tag is wrong
b/ the encoding in http response header is ok and different from the one
in meta tag

regards
reinhard





Reply via email to