Sorry, distracted.... You're calling Tika programmatically. The mime types _should_ show up in the metadata during the parse. Let me confirm that, though.
The MSOffice OLE2 file types are fully/precisely detected by Apache POI if you have tika-parsers on your classpath. If you only have tika-core, then Tika can't tell the diff between OLE2 file formats and gives you application/ms-office-file. Again, if you have tika-core and you submit a file, Tika will detect OLE2 from the bytes and then rely on the file name suffix to id the specific ole2 file type, e.g. .doc. If you want to use an inputstream, you can send in the file name via the metadata, and you'll get the same result as if you use file. I _highly_ recommend using TikaInputStream.get(File f, Metadata metadata) if you can. This has some efficiency benefits, and TikaInputStream will set the file name so you'll get the precise file type. On Tue, Dec 22, 2020 at 4:07 PM Tim Allison <[email protected]> wrote: > Hi Peter, Are you using tika-app, tika-server or something programmatic? > > On Tue, Dec 22, 2020 at 2:21 PM Peter Kronenberg < > [email protected]> wrote: > >> Hi, I just started playing with Tika and I have a few questions >> >> >> >> I’m trying to detect the mimetype of a file using both >> >> >> >> Tika.detect(InputStream) >> >> and >> >> Tika.detect(File) >> >> >> >> I get 2 different results. I’m testing with a Microsoft Word (.doc) file. >> >> >> >> As a stream, I get application/x-tika-msoffice. As a file I get >> application/msword >> >> >> >> Why are they different? >> >> >> >> I was also wondering why the mimetype is not returned in the metadata >> when parsing a file >> >> >> >> Thank you >> >> Peter >> >
