>> In my use case, we will not have any filename or metadata.  It will

>> just be a stream.  But you're right in that we will want to parse it.

>> So it sounds like the best way to do it is to do the detect on the

>> first few bytes, which will at least give you an idea of what it is,

>> but not precise. (Should this be a TikaStream?)  And then do the parse.



>Best is to wrap as a TikaInputStream, detect using all the detectors via 
>>DefaultDetector, then parse after that.



But sometimes the detect will read the whole file, right?  For example, for 
Word.  So is it then making 2 passes?



>> I'm still surprised, however, that the mimetype doesn't seem to appear

> >on the Metadata after parsing.



>IIRC it does if you use AutoDetectParser but not always otherwise, but I'm 
>>not certain on that...



Oh, ok, you’re right.  It’s listed as Content-Type.  I was searching for 
Mime-type. 😊





-----Original Message-----
From: Nick Burch <[email protected]>
Sent: Wednesday, December 23, 2020 9:28 AM
To: [email protected]
Subject: RE: Mimetypes



On Wed, 23 Dec 2020, Peter Kronenberg wrote:

> But yet, if I understand correctly, using a TikaInputStream *will*

> spool the entire stream to disk so it can read everything, right?  If

> I re-read the stream to parse, is it making 2 passes?



TikaInputStream has logic in it dump the stream to a temp file so it can be 
re-read multiple times as required. It only does that dump if required though, 
for formats that don't need it, it just acts as a buffering / mark

+ reset Stream



> In my use case, we will not have any filename or metadata.  It will

> just be a stream.  But you're right in that we will want to parse it.

> So it sounds like the best way to do it is to do the detect on the

> first few bytes, which will at least give you an idea of what it is,

> but not precise. (Should this be a TikaStream?)  And then do the parse.



Best is to wrap as a TikaInputStream, detect using all the detectors via 
DefaultDetector, then parse after that.



> I'm still surprised, however, that the mimetype doesn't seem to appear

> on the Metadata after parsing.



IIRC it does if you use AutoDetectParser but not always otherwise, but I'm not 
certain on that...



Nick

Reply via email to