RE: Charset detection

Paulini, Matthew CTR USAF AFMC AFRL/RISA Wed, 25 Jul 2012 05:51:05 -0700

I can see how the encoding might be useful to some people. However, I also 
agree that older code that is checking against the MIME type returned from Tika 
for equality (i.e. .equals() or .compareTo() in java) rather than (i.e. 
contains() in java) could cause some issues if the dependant code doesn't do 
extra processing on the MIME before their check. Since the encoding was never 
present before, the chances that older code would have done processing to grab 
just the MIME type portion of the returned string is slim, I would assume.
 
Wouldn't it be more backword compatible if you just added an "encoding" field 
to the list of metadata attributes that are returned?
 
~Scout

________________________________

From: Public Network Services [mailto:[email protected]]
Sent: Wed 7/25/2012 8:31 AM
To: [email protected]
Subject: Re: Charset detection

If it does not add much to processing, then it could be run earlier, for 
consistency purposes 

Having said that, I am not sure about the usefulness of appending the charset 
at the end of the detected MIME type string in the first place. It is correct 
from a syntax point, but it adds one more level of string processing to extract 
it (as opposed to just getting it from the metadata). Are we sure, for 
instance, that older code (checking for equality to "text/plain") will not be 
not broken?

Of course the decision has already been made and you guys know very well what 
you are doing, but it still puzzles me. :-)

On Wed, Jul 25, 2012 at 10:55 AM, Jukka Zitting <[email protected]> wrote:

        Hi,

        On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services
        <[email protected]> wrote:
        > Should that be the case?

        Yes. So far the extra charset detection code is only being run when
        you actually parse a document, so the charset parameter gets added at
        that point, not yet at type detection. Perhaps we should run charset
        detection already earlier at that point?

        BR,

        Jukka Zitting

<<winmail.dat>>

RE: Charset detection

Reply via email to