Not quite, filename gives extra info: for example, jar file is (almost) zip
file. Do if you look at content only, unsure you can tell. In a moment you
give extra "hint" in form of name (jar), it can work from more information,
and deduce better results.

On Tue, Dec 22, 2020, 22:21 Peter Kronenberg <[email protected]>
wrote:

> I have tika-parsers in my pom (I think the doc said that tika-core is a
> dependency).
>
>
>
> I’ll play around with TikaInputStream as well as sending in the filename.
> Although I’m still not quite sure I understand why the results would be
> different.  If you are looking at just the contents of the file and not
> relying on the file extension, shouldn’t the result be the same?
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Tuesday, December 22, 2020 4:16 PM
> *To:* [email protected]
> *Subject:* Re: Mimetypes
>
>
>
> Sorry, distracted....  You're calling Tika programmatically.
>
>
>
> The mime types _should_ show up in the metadata during the parse.  Let me
> confirm that, though.
>
>
>
> The MSOffice OLE2 file types are fully/precisely detected by Apache POI if
> you have tika-parsers on your classpath.  If you only have tika-core, then
> Tika can't tell the diff between OLE2 file formats and gives you
> application/ms-office-file.
>
>
>
> Again, if you have tika-core and you submit a file, Tika will detect OLE2
> from the bytes and then rely on the file name suffix to id the specific
> ole2 file type, e.g. .doc.  If you want to use an inputstream, you can send
> in the file name via the metadata, and you'll get the same result as if you
> use file.
>
>
>
> I _highly_ recommend using TikaInputStream.get(File f, Metadata metadata)
> if you can.  This has some efficiency benefits, and TikaInputStream will
> set the file name so you'll get the precise file type.
>
>
>
> On Tue, Dec 22, 2020 at 4:07 PM Tim Allison <[email protected]> wrote:
>
> Hi Peter, Are you using tika-app, tika-server or something programmatic?
>
>
>
> On Tue, Dec 22, 2020 at 2:21 PM Peter Kronenberg <
> [email protected]> wrote:
>
> Hi, I just started playing with Tika and I have a few questions
>
>
>
> I’m trying to detect the mimetype of a file using both
>
>
>
> Tika.detect(InputStream)
>
> and
>
> Tika.detect(File)
>
>
>
> I get 2 different results.  I’m testing with a Microsoft Word (.doc) file.
>
>
>
> As a stream, I get application/x-tika-msoffice.  As a file I get
> application/msword
>
>
>
> Why are they different?
>
>
>
> I was also wondering why the mimetype is not returned in the metadata when
> parsing a file
>
>
>
> Thank you
>
> Peter
>
>

Reply via email to