Hi Keith, The system can determine the mime type based on the content, if the magic parameter is enabled. This enables the mime system to use magic chars and examine the contents of the stream to determine its mime type. However, for efficiency reasons, this is typically turned off by default because it's not as fast as simply doing filename or URL comparisons.
This parameter is controlled by the attribute "magic" within the tika-config.xml file. Take a look at the mimeTypeRepository tag, and check the magic attribute. If set to "true", then magic resolution is done. That should get rid of the default "application/octet-stream" issue you're having. We should also have a look at the default mime types available within the tika-mimetypes.xml file. We may need to add some more in there. What forms of content did you test on? Which specific mime types did you see trouble with? Could you post them to the list? I'll look through them and add in the gaps to the tika-mimetypes.xml file. Thanks! Cheers, Chris On 10/11/07 3:18 PM, "Keith R. Bennett" <[EMAIL PROTECTED]> wrote: > > Chris - > > I'm not sure...on the one hand, since Tika is basically a text parsing tool, > we might want to make plain text the default MIME type. We couldn't really > do anything with an octet stream anyway, right? > > On the other hand, we wouldn't want to attempt to parse something that does > not have text, so a nonparseable MIME type such as octet stream as default > might make more sense. > > Isn't our framework supposed to determine the MIME type based on the > content? Is there perhaps just a configuration or code change that needs to > be made? If so, then this is not an issue. > > - Keith > > > Chris Mattmann wrote: >> >> Hi Keith, >> >> The default mime type in TIKA is application/octet-stream. It gets set >> when >> the mime type can't be determined using 3 main means (url resolution, >> extension resolution, or magic chars). This is in the MimeTypes.java file >> within the mime package. The reason no parser gets called is because there >> is no parser registered to handle that mime type. >> >> Are you suggesting that there is another, more sensible default? >> >> Thanks! >> >> Cheers, >> Chris >> >> ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
