Ahh, I see From: Tamás Cservenák <[email protected]> Sent: Tuesday, December 22, 2020 4:29 PM To: Tika Users <[email protected]> Subject: Re: Mimetypes
Not quite, filename gives extra info: for example, jar file is (almost) zip file. Do if you look at content only, unsure you can tell. In a moment you give extra "hint" in form of name (jar), it can work from more information, and deduce better results. On Tue, Dec 22, 2020, 22:21 Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: I have tika-parsers in my pom (I think the doc said that tika-core is a dependency). I’ll play around with TikaInputStream as well as sending in the filename. Although I’m still not quite sure I understand why the results would be different. If you are looking at just the contents of the file and not relying on the file extension, shouldn’t the result be the same? From: Tim Allison <[email protected]<mailto:[email protected]>> Sent: Tuesday, December 22, 2020 4:16 PM To: [email protected]<mailto:[email protected]> Subject: Re: Mimetypes Sorry, distracted.... You're calling Tika programmatically. The mime types _should_ show up in the metadata during the parse. Let me confirm that, though. The MSOffice OLE2 file types are fully/precisely detected by Apache POI if you have tika-parsers on your classpath. If you only have tika-core, then Tika can't tell the diff between OLE2 file formats and gives you application/ms-office-file. Again, if you have tika-core and you submit a file, Tika will detect OLE2 from the bytes and then rely on the file name suffix to id the specific ole2 file type, e.g. .doc. If you want to use an inputstream, you can send in the file name via the metadata, and you'll get the same result as if you use file. I _highly_ recommend using TikaInputStream.get(File f, Metadata metadata) if you can. This has some efficiency benefits, and TikaInputStream will set the file name so you'll get the precise file type. On Tue, Dec 22, 2020 at 4:07 PM Tim Allison <[email protected]<mailto:[email protected]>> wrote: Hi Peter, Are you using tika-app, tika-server or something programmatic? On Tue, Dec 22, 2020 at 2:21 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: Hi, I just started playing with Tika and I have a few questions I’m trying to detect the mimetype of a file using both Tika.detect(InputStream) and Tika.detect(File) I get 2 different results. I’m testing with a Microsoft Word (.doc) file. As a stream, I get application/x-tika-msoffice. As a file I get application/msword Why are they different? I was also wondering why the mimetype is not returned in the metadata when parsing a file Thank you Peter
