Sorry, distracted....  You're calling Tika programmatically.

The mime types _should_ show up in the metadata during the parse.  Let me
confirm that, though.

The MSOffice OLE2 file types are fully/precisely detected by Apache POI if
you have tika-parsers on your classpath.  If you only have tika-core, then
Tika can't tell the diff between OLE2 file formats and gives you
application/ms-office-file.

Again, if you have tika-core and you submit a file, Tika will detect OLE2
from the bytes and then rely on the file name suffix to id the specific
ole2 file type, e.g. .doc.  If you want to use an inputstream, you can send
in the file name via the metadata, and you'll get the same result as if you
use file.

I _highly_ recommend using TikaInputStream.get(File f, Metadata metadata)
if you can.  This has some efficiency benefits, and TikaInputStream will
set the file name so you'll get the precise file type.

On Tue, Dec 22, 2020 at 4:07 PM Tim Allison <[email protected]> wrote:

> Hi Peter, Are you using tika-app, tika-server or something programmatic?
>
> On Tue, Dec 22, 2020 at 2:21 PM Peter Kronenberg <
> [email protected]> wrote:
>
>> Hi, I just started playing with Tika and I have a few questions
>>
>>
>>
>> I’m trying to detect the mimetype of a file using both
>>
>>
>>
>> Tika.detect(InputStream)
>>
>> and
>>
>> Tika.detect(File)
>>
>>
>>
>> I get 2 different results.  I’m testing with a Microsoft Word (.doc) file.
>>
>>
>>
>> As a stream, I get application/x-tika-msoffice.  As a file I get
>> application/msword
>>
>>
>>
>> Why are they different?
>>
>>
>>
>> I was also wondering why the mimetype is not returned in the metadata
>> when parsing a file
>>
>>
>>
>> Thank you
>>
>> Peter
>>
>

Reply via email to