content encoding

reinhard schwab Wed, 18 Aug 2010 02:02:14 -0700

i want to parse a html file, which has two encodings in meta tags.

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" >


it seems that tika chooses the right one for text extraction.
(ISO-8859-1 is the proper encoding)
but if i want to know the proper encoding, is there any method in the
API to retrieve it?
the only method in my mind is

Metadata.get(HttpHeaders.CONTENT_ENCODING);

looking at the code, it chooses the first one.

public String get(final String name) {
        String[] values = metadata.get(name);
        if (values == null) {
            return null;
        } else {
            return values[0];
        }
    }

best regards
reinhard

content encoding

Reply via email to