Re: Charset detection

Jukka Zitting Wed, 25 Jul 2012 05:46:03 -0700

Hi,

On Wed, Jul 25, 2012 at 2:31 PM, Public Network Services
<[email protected]> wrote:
> Having said that, I am not sure about the usefulness of appending the
> charset at the end of the detected MIME type string in the first place. It
> is correct from a syntax point, but it adds one more level of string
> processing to extract it (as opposed to just getting it from the metadata).
> Are we sure, for instance, that older code (checking for equality to
> "text/plain") will not be not broken?


That was part of the thinking behind for now doing the charset
detection when a document is already being parsed instead of already
during type detection time. It's also why the change was described in
so much detail in CHANGES.txt.

In general I'd recommend people dealing with media types to move away
from basic string matching to using the MediaType and
MediaTypeRegistry classes. That way code that for example checks the
type detection result against something like "text/plain" won't start
failing with a Tika version that might decide to qualify the type with
"text/plain; charset=UTF-8" or to return a more detailed media type
like "text/x-java-source".

BR,

Jukka Zitting

Re: Charset detection

Reply via email to