Not quite, filename gives extra info: for example, jar file is (almost) zip file. Do if you look at content only, unsure you can tell. In a moment you give extra "hint" in form of name (jar), it can work from more information, and deduce better results.
On Tue, Dec 22, 2020, 22:21 Peter Kronenberg <[email protected]> wrote: > I have tika-parsers in my pom (I think the doc said that tika-core is a > dependency). > > > > I’ll play around with TikaInputStream as well as sending in the filename. > Although I’m still not quite sure I understand why the results would be > different. If you are looking at just the contents of the file and not > relying on the file extension, shouldn’t the result be the same? > > > > *From:* Tim Allison <[email protected]> > *Sent:* Tuesday, December 22, 2020 4:16 PM > *To:* [email protected] > *Subject:* Re: Mimetypes > > > > Sorry, distracted.... You're calling Tika programmatically. > > > > The mime types _should_ show up in the metadata during the parse. Let me > confirm that, though. > > > > The MSOffice OLE2 file types are fully/precisely detected by Apache POI if > you have tika-parsers on your classpath. If you only have tika-core, then > Tika can't tell the diff between OLE2 file formats and gives you > application/ms-office-file. > > > > Again, if you have tika-core and you submit a file, Tika will detect OLE2 > from the bytes and then rely on the file name suffix to id the specific > ole2 file type, e.g. .doc. If you want to use an inputstream, you can send > in the file name via the metadata, and you'll get the same result as if you > use file. > > > > I _highly_ recommend using TikaInputStream.get(File f, Metadata metadata) > if you can. This has some efficiency benefits, and TikaInputStream will > set the file name so you'll get the precise file type. > > > > On Tue, Dec 22, 2020 at 4:07 PM Tim Allison <[email protected]> wrote: > > Hi Peter, Are you using tika-app, tika-server or something programmatic? > > > > On Tue, Dec 22, 2020 at 2:21 PM Peter Kronenberg < > [email protected]> wrote: > > Hi, I just started playing with Tika and I have a few questions > > > > I’m trying to detect the mimetype of a file using both > > > > Tika.detect(InputStream) > > and > > Tika.detect(File) > > > > I get 2 different results. I’m testing with a Microsoft Word (.doc) file. > > > > As a stream, I get application/x-tika-msoffice. As a file I get > application/msword > > > > Why are they different? > > > > I was also wondering why the mimetype is not returned in the metadata when > parsing a file > > > > Thank you > > Peter > >
