Hi Folks,

 Thinking this through more, it probably makes a lot of sense for the
Default MIME TYPE in Tika to be application/octet-stream. Essentially what
this is saying is: "this is content for which I could not discern a mime
type". Well, application/octet-stream indicates that the content coming is
a sequence of bits. Well, following suit, essentially it is the super mime
type of all mime types in reality.

 I like the idea about implementing a UNIX strings style parser: there was
discussion on the Nutch list a year or so ago regarding this same issue. If
there isn't exact consensus on this issue; however, we could always make the
default mime type a settable parameter in the tika-config.xml file
mimeTypeRepository tag. That way, we could ship Tika with a default mime
type of application/octet-stream, and then if that doesn't work for users,
they simply update their attribute in their xml file (and perhaps
additionally turn on magic detection) and that would probably solve the
issue.

 Thoughts?  If you all agree, I'll create an issue about this in JIRA. Come
to think of it: I'll probably do that anyways.


Cheers, 
 Chris



On 10/12/07 6:12 PM, "Keith R. Bennett" <[EMAIL PROTECTED]> wrote:

> 
> By the way, strings detects sequences of ASCII characters, so it would not
> work at all in many locales (unless ICU or someone else has figured out how
> to do this).
> 
> As long as this limitation is documented, however, I think it would still be
> extremely useful, and trivial to implement.
> 
> - Keith

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to