But yet, if I understand correctly, using a TikaInputStream *will* spool the 
entire stream to disk so it can read everything, right?   If I re-read the 
stream to parse, is it making 2 passes?

In my use case, we will not have any filename or metadata.  It will just be a 
stream.  But you're right in that we will want to parse it.  So it sounds like 
the best way to do it is to do the detect on the first few bytes, which will at 
least give you an idea of what it is, but not precise. (Should this be a 
TikaStream?)  And then do the parse.
I'm still surprised, however, that the mimetype doesn't seem to appear on the 
Metadata after parsing.

-----Original Message-----
From: Nick Burch <[email protected]> 
Sent: Wednesday, December 23, 2020 4:52 AM
To: [email protected]
Subject: RE: Mimetypes

On Tue, 22 Dec 2020, Peter Kronenberg wrote:
> Oh, so reading the stream doesn't read the whole file?

Not for Detect, no. The assumption is that Detect is normally followed by 
Parse, so you won't want the Stream consuming, so we do a mark/reset to check 
the first few kb only

> I know for Office files you can tell it's an Office file from the 
> first dozen or so bytes, but you have to read the 2nd 512 block to 
> find out more.

Not always... Many tools opt to put the properties blocks very close to the 
start, which lets you tell the type (because you can see the entry names), not 
all do. For the rest, you need to open the OLE2 structure and check the names 
of the entries

Nick

Reply via email to