But yet, if I understand correctly, using a TikaInputStream *will* spool the entire stream to disk so it can read everything, right? If I re-read the stream to parse, is it making 2 passes?
In my use case, we will not have any filename or metadata. It will just be a stream. But you're right in that we will want to parse it. So it sounds like the best way to do it is to do the detect on the first few bytes, which will at least give you an idea of what it is, but not precise. (Should this be a TikaStream?) And then do the parse. I'm still surprised, however, that the mimetype doesn't seem to appear on the Metadata after parsing. -----Original Message----- From: Nick Burch <[email protected]> Sent: Wednesday, December 23, 2020 4:52 AM To: [email protected] Subject: RE: Mimetypes On Tue, 22 Dec 2020, Peter Kronenberg wrote: > Oh, so reading the stream doesn't read the whole file? Not for Detect, no. The assumption is that Detect is normally followed by Parse, so you won't want the Stream consuming, so we do a mark/reset to check the first few kb only > I know for Office files you can tell it's an Office file from the > first dozen or so bytes, but you have to read the 2nd 512 block to > find out more. Not always... Many tools opt to put the properties blocks very close to the start, which lets you tell the type (because you can see the entry names), not all do. For the rest, you need to open the OLE2 structure and check the names of the entries Nick
