RE: Mimetypes

2020-12-23 Thread Nick Burch
On Wed, 23 Dec 2020, Peter Kronenberg wrote: Best is to wrap as a TikaInputStream, detect using all the detectors via >DefaultDetector, then parse after that. But sometimes the detect will read the whole file, right? For example, for Word. So is it then making 2 passes? Nope, we stash the

RE: Mimetypes

2020-12-23 Thread Peter Kronenberg
>> In my use case, we will not have any filename or metadata. It will >> just be a stream. But you're right in that we will want to parse it. >> So it sounds like the best way to do it is to do the detect on the >> first few bytes, which will at least give you an idea of what it is, >> but

RE: Mimetypes

2020-12-23 Thread Nick Burch
On Wed, 23 Dec 2020, Peter Kronenberg wrote: But yet, if I understand correctly, using a TikaInputStream *will* spool the entire stream to disk so it can read everything, right? If I re-read the stream to parse, is it making 2 passes? TikaInputStream has logic in it dump the stream to a temp

RE: Mimetypes

2020-12-23 Thread Peter Kronenberg
But yet, if I understand correctly, using a TikaInputStream *will* spool the entire stream to disk so it can read everything, right? If I re-read the stream to parse, is it making 2 passes? In my use case, we will not have any filename or metadata. It will just be a stream. But you're

RE: Mimetypes

2020-12-23 Thread Nick Burch
On Tue, 22 Dec 2020, Peter Kronenberg wrote: Oh, so reading the stream doesn't read the whole file? Not for Detect, no. The assumption is that Detect is normally followed by Parse, so you won't want the Stream consuming, so we do a mark/reset to check the first few kb only I know for