Re: content encoding

reinhard schwab Wed, 18 Aug 2010 02:11:29 -0700

sorry,
there is only one value stored in

HttpHeaders.CONTENT_ENCODING


the right one.
content-type has two values...

Content-Encoding: ISO-8859-1
Content-Language: de
Content-Type: text/html; charset=utf-8
content-type: text/html; charset=ISO-8859-1

best regards
reinhard

reinhard schwab schrieb:
> i want to parse a html file, which has two encodings in meta tags.
>
> <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
>
> it seems that tika chooses the right one for text extraction.
> (ISO-8859-1 is the proper encoding)
> but if i want to know the proper encoding, is there any method in the
> API to retrieve it?
> the only method in my mind is
>
> Metadata.get(HttpHeaders.CONTENT_ENCODING);
>
> looking at the code, it chooses the first one.
>
> public String get(final String name) {
>         String[] values = metadata.get(name);
>         if (values == null) {
>             return null;
>         } else {
>             return values[0];
>         }
>     }
>
> best regards
> reinhard
>
>
>

Re: content encoding

Reply via email to