Re: Charset detection

Public Network Services Wed, 25 Jul 2012 15:11:02 -0700

Of course the return type is MediaType, i.e.

MediaType type = TikaConfig.getDefaultConfig().getDetector().detect(...);




On Thu, Jul 26, 2012 at 1:03 AM, Public Network Services <
[email protected]> wrote:

> Actually, I am surprised that many people are not shouting about this
> already.
>
> All the static detect() methods of the Tika convenience class return the
> mime type as a String and, if not the recommended approache, they are
> certainly very popular.
>
> I have always been puzzled as to why the return type of such methods
> should be String, as opposed to a MimeType object.
>
> Tika is an excellent work and all the contributors are to be
> congratulated, but, in all due respect, it seems that this modification of
> the return String for "text/plain" will cause numerous headaches.
>
> Perhaps you should issue a directive that people should use the MimeType
> class, even if by creating such objects by parsing the String that
> Tika.detect() returns. Or, do something like
>
> MimeType type = TikaConfig.getDefaultConfig().getDetector().detect(...);
>
>
> :-)
>
>
> On Wed, Jul 25, 2012 at 3:50 PM, Paulini, Matthew CTR USAF AFMC AFRL/RISA
> <[email protected]> wrote:
>
>> I can see how the encoding might be useful to some people. However, I
>> also agree that older code that is checking against the MIME type returned
>> from Tika for equality (i.e. .equals() or .compareTo() in java) rather than
>> (i.e. contains() in java) could cause some issues if the dependant code
>> doesn't do extra processing on the MIME before their check. Since the
>> encoding was never present before, the chances that older code would have
>> done processing to grab just the MIME type portion of the returned string
>> is slim, I would assume.
>>
>> Wouldn't it be more backword compatible if you just added an "encoding"
>> field to the list of metadata attributes that are returned?
>>
>> ~Scout
>>
>> ________________________________
>>
>> From: Public Network Services [mailto:[email protected]]
>> Sent: Wed 7/25/2012 8:31 AM
>> To: [email protected]
>> Subject: Re: Charset detection
>>
>>
>> If it does not add much to processing, then it could be run earlier, for
>> consistency purposes
>>
>> Having said that, I am not sure about the usefulness of appending the
>> charset at the end of the detected MIME type string in the first place. It
>> is correct from a syntax point, but it adds one more level of string
>> processing to extract it (as opposed to just getting it from the metadata).
>> Are we sure, for instance, that older code (checking for equality to
>> "text/plain") will not be not broken?
>>
>> Of course the decision has already been made and you guys know very well
>> what you are doing, but it still puzzles me. :-)
>>
>>
>> On Wed, Jul 25, 2012 at 10:55 AM, Jukka Zitting <[email protected]>
>> wrote:
>>
>>
>>         Hi,
>>
>>
>>         On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services
>>         <[email protected]> wrote:
>>         > Should that be the case?
>>
>>
>>         Yes. So far the extra charset detection code is only being run
>> when
>>         you actually parse a document, so the charset parameter gets
>> added at
>>         that point, not yet at type detection. Perhaps we should run
>> charset
>>         detection already earlier at that point?
>>
>>         BR,
>>
>>         Jukka Zitting
>>
>>
>>
>>
>

Re: Charset detection

Reply via email to