Hi all,

I was trying out my mbox parser via the TikaCLI command line tool.

My parser wasn't getting called - rather, the generic text parser was used.

The problem is that in tika-mimetypes.xml, the application/mbox entry didn't specify that it was a subtype of text/plain.

So even though the name detector code correctly generated application/ mbox as the type hint due to the .mbox suffix, this was ignored because it wasn't a subtype of the content-based type that was derived previously as text/plain.

Easy enough to fix, but in looking through the tika-mimetypes.xml file I wonder how many other types need similar treatment. For example:

  <mime-type type="application/xspf+xml">
    <glob pattern="*.xspf"/>
  </mime-type>

If I use the TikaCLI with a test.xspf file, the mime-type it derives is application/xml, not application/xspf+xml as expected.

One partial fix here would be to extend the MimeTypes.forName
method to check for "+xml" at the end, similar to how it checks for "text/" at the beginning, and auto-set the parent to application/xml.

-- Ken

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to