Ahh, I see

From: Tamás Cservenák <[email protected]>
Sent: Tuesday, December 22, 2020 4:29 PM
To: Tika Users <[email protected]>
Subject: Re: Mimetypes

Not quite, filename gives extra info: for example, jar file is (almost) zip 
file. Do if you look at content only, unsure you can tell. In a moment you give 
extra "hint" in form of name (jar), it can work from more information, and 
deduce better results.

On Tue, Dec 22, 2020, 22:21 Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
I have tika-parsers in my pom (I think the doc said that tika-core is a 
dependency).

I’ll play around with TikaInputStream as well as sending in the filename.  
Although I’m still not quite sure I understand why the results would be 
different.  If you are looking at just the contents of the file and not relying 
on the file extension, shouldn’t the result be the same?

From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Tuesday, December 22, 2020 4:16 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Mimetypes

Sorry, distracted....  You're calling Tika programmatically.

The mime types _should_ show up in the metadata during the parse.  Let me 
confirm that, though.

The MSOffice OLE2 file types are fully/precisely detected by Apache POI if you 
have tika-parsers on your classpath.  If you only have tika-core, then Tika 
can't tell the diff between OLE2 file formats and gives you 
application/ms-office-file.

Again, if you have tika-core and you submit a file, Tika will detect OLE2 from 
the bytes and then rely on the file name suffix to id the specific ole2 file 
type, e.g. .doc.  If you want to use an inputstream, you can send in the file 
name via the metadata, and you'll get the same result as if you use file.

I _highly_ recommend using TikaInputStream.get(File f, Metadata metadata) if 
you can.  This has some efficiency benefits, and TikaInputStream will set the 
file name so you'll get the precise file type.

On Tue, Dec 22, 2020 at 4:07 PM Tim Allison 
<[email protected]<mailto:[email protected]>> wrote:
Hi Peter, Are you using tika-app, tika-server or something programmatic?

On Tue, Dec 22, 2020 at 2:21 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
Hi, I just started playing with Tika and I have a few questions

I’m trying to detect the mimetype of a file using both

Tika.detect(InputStream)
and
Tika.detect(File)

I get 2 different results.  I’m testing with a Microsoft Word (.doc) file.

As a stream, I get application/x-tika-msoffice.  As a file I get 
application/msword

Why are they different?

I was also wondering why the mimetype is not returned in the metadata when 
parsing a file

Thank you
Peter

Reply via email to