>> In my use case, we will not have any filename or metadata. It will
>> just be a stream. But you're right in that we will want to parse it. >> So it sounds like the best way to do it is to do the detect on the >> first few bytes, which will at least give you an idea of what it is, >> but not precise. (Should this be a TikaStream?) And then do the parse. >Best is to wrap as a TikaInputStream, detect using all the detectors via >>DefaultDetector, then parse after that. But sometimes the detect will read the whole file, right? For example, for Word. So is it then making 2 passes? >> I'm still surprised, however, that the mimetype doesn't seem to appear > >on the Metadata after parsing. >IIRC it does if you use AutoDetectParser but not always otherwise, but I'm >>not certain on that... Oh, ok, you’re right. It’s listed as Content-Type. I was searching for Mime-type. 😊 -----Original Message----- From: Nick Burch <[email protected]> Sent: Wednesday, December 23, 2020 9:28 AM To: [email protected] Subject: RE: Mimetypes On Wed, 23 Dec 2020, Peter Kronenberg wrote: > But yet, if I understand correctly, using a TikaInputStream *will* > spool the entire stream to disk so it can read everything, right? If > I re-read the stream to parse, is it making 2 passes? TikaInputStream has logic in it dump the stream to a temp file so it can be re-read multiple times as required. It only does that dump if required though, for formats that don't need it, it just acts as a buffering / mark + reset Stream > In my use case, we will not have any filename or metadata. It will > just be a stream. But you're right in that we will want to parse it. > So it sounds like the best way to do it is to do the detect on the > first few bytes, which will at least give you an idea of what it is, > but not precise. (Should this be a TikaStream?) And then do the parse. Best is to wrap as a TikaInputStream, detect using all the detectors via DefaultDetector, then parse after that. > I'm still surprised, however, that the mimetype doesn't seem to appear > on the Metadata after parsing. IIRC it does if you use AutoDetectParser but not always otherwise, but I'm not certain on that... Nick
