Re: Charset detection

Public Network Services Wed, 25 Jul 2012 15:03:54 -0700

Actually, I am surprised that many people are not shouting about this
already.


All the static detect() methods of the Tika convenience class return the
mime type as a String and, if not the recommended approache, they are
certainly very popular.

I have always been puzzled as to why the return type of such methods should
be String, as opposed to a MimeType object.

Tika is an excellent work and all the contributors are to be congratulated,
but, in all due respect, it seems that this modification of the return
String for "text/plain" will cause numerous headaches.

Perhaps you should issue a directive that people should use the MimeType
class, even if by creating such objects by parsing the String that
Tika.detect() returns. Or, do something like

MimeType type = TikaConfig.getDefaultConfig().getDetector().detect(...);


:-)


On Wed, Jul 25, 2012 at 3:50 PM, Paulini, Matthew CTR USAF AFMC AFRL/RISA <
[email protected]> wrote:

> I can see how the encoding might be useful to some people. However, I also
> agree that older code that is checking against the MIME type returned from
> Tika for equality (i.e. .equals() or .compareTo() in java) rather than
> (i.e. contains() in java) could cause some issues if the dependant code
> doesn't do extra processing on the MIME before their check. Since the
> encoding was never present before, the chances that older code would have
> done processing to grab just the MIME type portion of the returned string
> is slim, I would assume.
>
> Wouldn't it be more backword compatible if you just added an "encoding"
> field to the list of metadata attributes that are returned?
>
> ~Scout
>
> ________________________________
>
> From: Public Network Services [mailto:[email protected]]
> Sent: Wed 7/25/2012 8:31 AM
> To: [email protected]
> Subject: Re: Charset detection
>
>
> If it does not add much to processing, then it could be run earlier, for
> consistency purposes
>
> Having said that, I am not sure about the usefulness of appending the
> charset at the end of the detected MIME type string in the first place. It
> is correct from a syntax point, but it adds one more level of string
> processing to extract it (as opposed to just getting it from the metadata).
> Are we sure, for instance, that older code (checking for equality to
> "text/plain") will not be not broken?
>
> Of course the decision has already been made and you guys know very well
> what you are doing, but it still puzzles me. :-)
>
>
> On Wed, Jul 25, 2012 at 10:55 AM, Jukka Zitting <[email protected]>
> wrote:
>
>
>         Hi,
>
>
>         On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services
>         <[email protected]> wrote:
>         > Should that be the case?
>
>
>         Yes. So far the extra charset detection code is only being run when
>         you actually parse a document, so the charset parameter gets added
> at
>         that point, not yet at type detection. Perhaps we should run
> charset
>         detection already earlier at that point?
>
>         BR,
>
>         Jukka Zitting
>
>
>
>

Re: Charset detection

Reply via email to