Hi Keith,

 The system can determine the mime type based on the content, if the magic
parameter is enabled. This enables the mime system to use magic chars and
examine the contents of the stream to determine its mime type. However, for
efficiency reasons, this is typically turned off by default because it's not
as fast as simply doing filename or URL comparisons.

 This parameter is controlled by the attribute "magic" within the
tika-config.xml file. Take a look at the mimeTypeRepository tag, and check
the magic attribute. If set to "true", then magic resolution is done. That
should get rid of the default "application/octet-stream" issue you're
having.

 We should also have a look at the default mime types available within the
tika-mimetypes.xml file. We may need to add some more in there. What forms
of content did you test on? Which specific mime types did you see trouble
with? Could you post them to the list? I'll look through them and add in the
gaps to the tika-mimetypes.xml file.

 Thanks!

Cheers,
  Chris



On 10/11/07 3:18 PM, "Keith R. Bennett" <[EMAIL PROTECTED]> wrote:

> 
> Chris -
> 
> I'm not sure...on the one hand, since Tika is basically a text parsing tool,
> we might want to make plain text the default MIME type.  We couldn't really
> do anything with an octet stream anyway, right?
> 
> On the other hand, we wouldn't want to attempt to parse something that does
> not have text, so a nonparseable MIME type such as octet stream as default
> might make more sense.
> 
> Isn't our framework supposed to determine the MIME type based on the
> content?  Is there perhaps just a configuration or code change that needs to
> be made?  If so, then this is not an issue.
> 
> - Keith
> 
> 
> Chris Mattmann wrote:
>> 
>> Hi Keith,
>> 
>>  The default mime type in TIKA is application/octet-stream. It gets set
>> when
>> the mime type can't be determined using 3 main means (url resolution,
>> extension resolution, or magic chars). This is in the MimeTypes.java file
>> within the mime package. The reason no parser gets called is because there
>> is no parser registered to handle that mime type.
>> 
>>  Are you suggesting that there is another, more sensible default?
>> 
>> Thanks!
>> 
>> Cheers,
>>   Chris
>> 
>> 

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to