detectors should return the stream reset to the beginning. Parsers, IIRC, should return the stream fully(?) read but not closed.
On Mon, Mar 1, 2021 at 10:29 AM Tim Allison <[email protected]> wrote: > Reusing streams after parsing hasn't been something I've done before... > > This is not expected behavior. Parsers should all behave the same. > > On Mon, Mar 1, 2021 at 10:24 AM Peter Kronenberg < > [email protected]> wrote: > >> After more testing, it seems that it has nothing to do with >> TikaInputStream. I just passed in a BufferedInputStream to the parsers. I >> see that the first thing the AutoDetactParser does is to convert it to a >> TikaInputStream. So maybe TIS is being leveraged at a lower level, but >> there no reason for me to use the TIS at my level. >> >> But the issue is that different parsers return the stream in different >> states. Sometimes the stream is all used up (although not closed). And >> other times, the stream has been re-set to the beginning where it can be >> re-used. Is this expected behavior? >> >> >> >> >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Friday, February 26, 2021 10:03 PM >> *To:* [email protected] >> *Cc:* [email protected]; [email protected] >> *Subject:* RE: Re-using a TikaStream >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> But as I said, this doesn’t seem to work with all parsers. So let’s >> say I pass in an MP4 file which uses the MP4Parser and then I want to >> re-use the stream afterward. How can I guarantee consistent beahvor, no >> matter which paser gets used? >> >> >> >> *From:* Tim Allison <[email protected]> >> *Sent:* Friday, February 26, 2021 3:17 PM >> *To:* Peter Kronenberg <[email protected]> >> *Cc:* [email protected]; [email protected] >> *Subject:* Re: Re-using a TikaStream >> >> >> >> The stream.available() call comes from ProxyInputStream. We don't modify >> that in TikaInputStream...maybe we should. >> >> >> >> TikaInputStream wraps an incoming InputStream in a BufferedInputStream if >> it doesn't supportMark already. >> >> >> >> So, as long as you're happy with the performance and potential >> limitations of BufferedInputStream, go with TikaInputStream. >> >> >> >> Note that some parsers have to spool to disk. TikaInputStream takes care >> of this for you. >> >> >> >> On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg < >> [email protected]> wrote: >> >> I think I figured this out. It seems to depend on what parser is used. >> Not sure if this just has to do with inconsistent implementations, or there >> is some reason behind it. >> >> >> >> For most audio files, using the AudioParser, the buffer is still at the >> beginning. Even though there is no text extraction, I would think that >> Tika still needs to read through the stream. >> >> The MP3Parser consumes the stream, but the MP4Parser does not >> >> >> >> The OCR parser also leaves the pointer at the beginning. It definitely >> consumes the stream, so it must be resetting it. >> >> >> >> So what is going on. And now I get back to my original question, which >> is, what is the best way to consistently be able to re-use the stream? >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Friday, February 26, 2021 12:18 PM >> *To:* [email protected]; [email protected] >> *Cc:* [email protected] >> *Subject:* RE: Re-using a TikaStream >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> So is this guaranteed, expected behavior? >> >> >> >> With a BufferedInputStream – I expect this >> >> >> >> >> *try *(BufferedInputStream stream = *new *BufferedInputStream(*new >> *FileInputStream(file))) >> { >> System.*out*.printf(*"before - bytes available: %s"*, >> stream.available()); >> parser.parse(stream, handler, metadata, parseContext); >> System.*out*.printf(*"after - bytes available: %s%n"*, >> stream.available()); >> } >> >> >> >> before - bytes available: 10546620 >> >> after - bytes available: 0 >> >> >> >> >> >> >> >> But with a TikaInputStream, I get this >> >> >> >> Note that I’m purposing creating a FileInputStream first in order to hide >> the file information from the TikaInputStream, since in my normal use case, >> I’m dealing with a regular InputStream, not reading from a file >> >> >> *try *(TikaInputStream stream = TikaInputStream.*get*(*new >> *FileInputStream(file))) { >> System.*out*.printf(*"before - bytes available: %s, position: %s%n"*, >> stream.available(), stream.getPosition()); >> parser.parse(stream, handler, metadata, parseContext); >> System.*out*.printf(*"after - bytes available: %s, position: %s%n"*, >> stream.available(), stream.getPosition()); >> } >> >> >> >> before - bytes available: 10546620, position: 0 >> >> after - bytes available: 10546620, position: 0 >> >> >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Thursday, February 25, 2021 11:28 AM >> *To:* [email protected]; [email protected] >> *Cc:* [email protected] >> *Subject:* RE: Re-using a TikaStream >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> Or reading from the cloud, either Google or AWS, in which case I also get >> a stream. I know what the file name is, but can’t really use it >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Thursday, February 25, 2021 11:19 AM >> *To:* [email protected] >> *Cc:* [email protected]; [email protected] >> *Subject:* RE: Re-using a TikaStream >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> With a stream. I am reading arbitrary streams and one of the goals is to >> figure out what it is. So there is no file backing it. >> >> >> >> *From:* Tim Allison <[email protected]> >> *Sent:* Thursday, February 25, 2021 11:11 AM >> *To:* Peter Kronenberg <[email protected]> >> *Cc:* [email protected]; [email protected] >> *Subject:* Re: Re-using a TikaStream >> >> >> >> Are you initializing w a file or a stream? >> >> >> >> On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg < >> [email protected]> wrote: >> >> But how is TikaInputStream allowing me to re-use the stream without me >> doing anything special? Is it automatically spooling to disk as needed? >> >> >> >> I wouldn’t say that I can’t afford to spool to disk. I’m just looking >> for the most reasonable solution. I don’t know how big the streams are >> that I’ll be processing. Obviously, if they’re big, the keeping them in >> memory is not reasonable and disk is the only option. But for smaller >> streams, if it can do it all in memory, that’s obviously better. And for >> my use case, I don’t **always** have to re-read the stream. >> >> >> >> *From:* Tim Allison <[email protected]> >> *Sent:* Thursday, February 25, 2021 5:48 AM >> *To:* [email protected] >> *Cc:* [email protected] >> *Subject:* Re: Re-using a TikaStream >> >> >> >> My $0.02 would be to use TikaInputStream because that gets a lot more use >> and is battle-tested. Within the last year or so, we started using >> RereadableInputStream in one of the Microsoft format parsers so it is also >> getting some use now. >> >> >> >> If you absolutely can't afford to spool to disk, then give >> RereadableInputStream a try. >> >> >> >> The inputstreamfactories, in my mind, are somewhat work-arounds for other >> use cases, e.g. retrying/batch etc. >> >> >> >> On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg < >> [email protected]> wrote: >> >> So this might be moot, because it seems that TikaInputStream is already >> doing some magic and I’m not sure how. >> >> I was able to re-use the stream without doing anything special after a >> call to parse. And in fact, I displayed stream.available() and >> stream.position() before and after the call to parse, and the full stream >> was still available at position 0. What is TikaInputStream doing to make >> this happen? >> >> >> >> Just for some additional context, what I’m doing is running the file >> through Tika and then, depending on the file type, I want to do some >> additional non-tika processing. I thought that once the Tika parse was >> done, the stream would be used up. >> >> >> >> What is going on? >> >> >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Tuesday, February 23, 2021 10:00 AM >> *To:* [email protected]; [email protected] >> *Subject:* RE: Re-using a TikaStream >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> I just found the RereadableInputStream. This looks more like what I was >> thinking. Is there any reason not to use it? What are the Tika best >> practices? Pros/Cons of each approach? If RereadableInputStream works as >> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Monday, February 22, 2021 8:30 PM >> *To:* [email protected] >> *Cc:* [email protected] >> *Subject:* RE: Re-using a TikaStream >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> Oh ok. I didn’t realize I needed to write my own class to implement it. >> I was looking for some sort of existing framework. >> >> >> >> What is the purpose of the 2 InputStreamFactory classes: >> >> >> >> I was re-reading some emails with Nick Burch back around Dec 22-23 and >> maybe I mis-understood him, but it sounds like he was saying that >> TiksInputStream was smart enough to automatically spool the stream to disk >> to allow re-use. >> >> >> >> It seems to me that I need an extra pass through the data in order to >> save to disk. I’m not starting from a File, but from a stream. So if I >> need to read the stream twice, I really have to pass through the data 3 >> times, correct? >> >> Unless there is a way to save to disk during the first pass >> >> >> >> (try/catch removed for simplicity) >> >> >> >> tis = TikaInputSream.get(InputStream); >> >> file = tis.getFile(); ç extra pass >> >> tis = TikaInputStream.get(new MyInputStreamFactory(file)); >> >> // first real pass >> >> InputStream is = tis.getInputStreamFactory().getInputStream() >> >> // second real pass >> >> } >> >> >> >> >> >> >> >> *From:* Luís Filipe Nassif <[email protected]> >> *Sent:* Monday, February 22, 2021 5:42 PM >> *To:* Peter Kronenberg <[email protected]> >> *Cc:* [email protected] >> *Subject:* Re: Re-using a TikaStream >> >> >> >> Something like: >> >> >> >> class MyInputStreamFactory implements InputStreamFactory{ >> >> >> >> private File file; >> >> >> >> public MyInputStreamFactory(File file){ >> >> this.file = file; >> >> } >> >> >> >> public InputStream getInputStream(){ >> >> return new FileInputStream(file); >> >> } >> >> } >> >> >> >> in your client code: >> >> >> >> Parser parser = new AutoDetectParser(); >> >> TikaInputStream tis = TikaInputStream.get(new >> MyInputStreamFactory(file)); >> >> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new >> ParseContext()); >> >> >> >> when you need to reuse the stream (into your parser): >> >> >> >> public void parse(InputStream stream, ContentHandler handler, Metadata >> metadata, ParseContext context) >> throws IOException, SAXException, TikaException { >> >> //(...) >> >> TikaInputStream tis = TikaInputStream.get(stream); >> >> if(tis.hasInputStreamFactory()){ >> >> try(InputStream is = >> tis.getInputStreamFactory().getInputStream()){ >> >> //consume the new stream >> >> } >> >> }else >> >> throw new IOException("not a reusable inputStream"); >> >> } >> >> >> >> Of course this is useful if you are not processing files, e.g. reading >> files from the cloud or sockets. >> >> >> >> Regards, >> >> Luis >> >> >> >> >> >> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg < >> [email protected]> escreveu: >> >> I sent this question late on Friday. Sending it again. Can you provide >> a little more information how out to use the InputStreamFactory? >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Friday, February 19, 2021 5:10 PM >> *To:* [email protected]; [email protected] >> *Subject:* RE: Re-using a TikaStream >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> There appear to be 2 InputStreamFactory classes: in tika-server-core and >> tika-io. The one in server.core is the only one with a concrete class. >> >> I’m not quite sure I see how to use this. >> >> Normally, I create a TikaInputStream with >> TikaInputStream.get(InputStream). How do I create it from an >> InputStreamFactory? >> >> TikaInputStream.getInputStreamFactory() only returns a factory if the >> TikaInputStream was created from a factory. >> >> Is there a good example of how this is used >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Friday, February 19, 2021 4:57 PM >> *To:* [email protected]; [email protected] >> *Subject:* RE: Re-using a TikaStream >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> Thanks. I thought that TikaInputStream already automatically saved to >> disk to allow re-reading. >> >> >> >> *From:* Luís Filipe Nassif <[email protected]> >> *Sent:* Friday, February 19, 2021 3:44 PM >> *To:* [email protected] >> *Subject:* Re: Re-using a TikaStream >> >> >> >> You could call TikaInputStream.getPath() at the beginning of your parser, >> it will spool to file if not file based. After consuming the original >> inputStream, create a new one from the temp file created. >> >> >> >> If you are using 2.0.0-ALPHA, there is: >> >> >> >> >> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java >> >> >> >> Use with the new methods from TikaInputStream: >> >> public static TikaInputStream get(InputStreamFactory factory) >> >> public InputStreamFactory getInputStreamFactory() >> >> >> >> Hope this helps, >> >> Luis >> >> >> >> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg < >> [email protected]> escreveu: >> >> If I finish parsing a TikaStream, can I re-use the stream (before it is >> closed)? I know you said that there is some magic behind the scenes where >> it spools it to a file. Can I just call reset() to start from the >> beginning? >> >> >> >> Peter >> >> >> >> >> >> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >> >> *C: 703.887.5623* >> >> [image: Torch AI] <http://www.torch.ai/> >> >> 4303 W. 119th St., Leawood, KS 66209 >> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g> >> WWW.TORCH.AI <http://www.torch.ai/> >> >> >> >> >> >>
