Hi, On Wed, Jul 25, 2012 at 2:31 PM, Public Network Services <[email protected]> wrote: > Having said that, I am not sure about the usefulness of appending the > charset at the end of the detected MIME type string in the first place. It > is correct from a syntax point, but it adds one more level of string > processing to extract it (as opposed to just getting it from the metadata). > Are we sure, for instance, that older code (checking for equality to > "text/plain") will not be not broken?
That was part of the thinking behind for now doing the charset detection when a document is already being parsed instead of already during type detection time. It's also why the change was described in so much detail in CHANGES.txt. In general I'd recommend people dealing with media types to move away from basic string matching to using the MediaType and MediaTypeRegistry classes. That way code that for example checks the type detection result against something like "text/plain" won't start failing with a Tika version that might decide to qualify the type with "text/plain; charset=UTF-8" or to return a more detailed media type like "text/x-java-source". BR, Jukka Zitting
