I can see how the encoding might be useful to some people. However, I also agree that older code that is checking against the MIME type returned from Tika for equality (i.e. .equals() or .compareTo() in java) rather than (i.e. contains() in java) could cause some issues if the dependant code doesn't do extra processing on the MIME before their check. Since the encoding was never present before, the chances that older code would have done processing to grab just the MIME type portion of the returned string is slim, I would assume. Wouldn't it be more backword compatible if you just added an "encoding" field to the list of metadata attributes that are returned? ~Scout
________________________________ From: Public Network Services [mailto:[email protected]] Sent: Wed 7/25/2012 8:31 AM To: [email protected] Subject: Re: Charset detection If it does not add much to processing, then it could be run earlier, for consistency purposes Having said that, I am not sure about the usefulness of appending the charset at the end of the detected MIME type string in the first place. It is correct from a syntax point, but it adds one more level of string processing to extract it (as opposed to just getting it from the metadata). Are we sure, for instance, that older code (checking for equality to "text/plain") will not be not broken? Of course the decision has already been made and you guys know very well what you are doing, but it still puzzles me. :-) On Wed, Jul 25, 2012 at 10:55 AM, Jukka Zitting <[email protected]> wrote: Hi, On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services <[email protected]> wrote: > Should that be the case? Yes. So far the extra charset detection code is only being run when you actually parse a document, so the charset parameter gets added at that point, not yet at type detection. Perhaps we should run charset detection already earlier at that point? BR, Jukka Zitting
<<winmail.dat>>
