sorry,
there is only one value stored in
HttpHeaders.CONTENT_ENCODING
the right one.
content-type has two values...
Content-Encoding: ISO-8859-1
Content-Language: de
Content-Type: text/html; charset=utf-8
content-type: text/html; charset=ISO-8859-1
best regards
reinhard
reinhard schwab schrieb:
> i want to parse a html file, which has two encodings in meta tags.
>
> <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
>
> it seems that tika chooses the right one for text extraction.
> (ISO-8859-1 is the proper encoding)
> but if i want to know the proper encoding, is there any method in the
> API to retrieve it?
> the only method in my mind is
>
> Metadata.get(HttpHeaders.CONTENT_ENCODING);
>
> looking at the code, it chooses the first one.
>
> public String get(final String name) {
> String[] values = metadata.get(name);
> if (values == null) {
> return null;
> } else {
> return values[0];
> }
> }
>
> best regards
> reinhard
>
>
>