Hi,

On Sun, Aug 21, 2011 at 12:07 PM, Jakub Liska <[email protected]> wrote:
> MediaType mediaType = MediaType.parse(tika.detect(inputStream));
> String mimeType = mediaType.getSubtype();

No need to parse the media type and pick just a part of it. From your
examples it looks like you're interested in the full type/subtype
string, which you should get simply with:

    String mimeType = tika.detect(inputStream);

The reason why you're getting application/x-tika-msoffice as a result
instead of application/vnd.ms-excel or application/zip instead of a
more specific OOXML media type is that the default byte pattern
detection rules in tika-core can only detect the generic OLE2 or ZIP
format used by all MS Office document types.

There are two ways to get a more specific media type from Tika's type
detection metchanism. The first one, since it looks like you have your
test document as normal files, is simply to give the full file instead
of just the input stream to Tika, so that it can use the file
extension to reason out the most likely specific media type, like
this:

    String mimeType = tika.detect(file);

A more complicated but also more accurate mechanism is for Tika to
actually try parsing the generic OLE2 or ZIP container and use the
contained information to determine the more specific media type. To do
this you need at least Tika 0.9 and you need to include not just
tika-core but also tika-parsers and the POI jars in your classpath.
Once you have your classpath set up, Tika will automatically enable
this more accurate type detection mechanism. See TIKA-447 [1] for the
full details. You can test the detection mechanism with the tika-app
jar like this:

    $ java -jar tika-app-0.9.jar --detect test.doc
    $ java -jar tika-app-0.9.jar --detect < test.doc


You also asked about the mimetypes file:

> What is the alias about?

Some media types are have multiple widely used aliases. For example
both application/msword and application/vnd.ms-word are widely used to
refer to the old OLE2-based MS Word file format, even though only the
former is officially registered at IANA. The alias settings in the
mimetypes file allow Tika to correctly detect such aliases and to
automatically map them to the official type name.

> And how to get the iana.org mime-type name instead of sub-class-of type name ?

See above.

[1] https://issues.apache.org/jira/browse/TIKA-447

BR,

Jukka Zitting

Reply via email to