i want to parse a html file, which has two encodings in meta tags.
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
it seems that tika chooses the right one for text extraction.
(ISO-8859-1 is the proper encoding)
but if i want to know the proper encoding, is there any method in the
API to retrieve it?
the only method in my mind is
Metadata.get(HttpHeaders.CONTENT_ENCODING);
looking at the code, it chooses the first one.
public String get(final String name) {
String[] values = metadata.get(name);
if (values == null) {
return null;
} else {
return values[0];
}
}
best regards
reinhard