Reusing streams after parsing hasn't been something I've done before... This is not expected behavior. Parsers should all behave the same.
On Mon, Mar 1, 2021 at 10:24 AM Peter Kronenberg <[email protected]> wrote: > After more testing, it seems that it has nothing to do with > TikaInputStream. I just passed in a BufferedInputStream to the parsers. I > see that the first thing the AutoDetactParser does is to convert it to a > TikaInputStream. So maybe TIS is being leveraged at a lower level, but > there no reason for me to use the TIS at my level. > > But the issue is that different parsers return the stream in different > states. Sometimes the stream is all used up (although not closed). And > other times, the stream has been re-set to the beginning where it can be > re-used. Is this expected behavior? > > > > > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Friday, February 26, 2021 10:03 PM > *To:* [email protected] > *Cc:* [email protected]; [email protected] > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > But as I said, this doesn’t seem to work with all parsers. So let’s say > I pass in an MP4 file which uses the MP4Parser and then I want to re-use > the stream afterward. How can I guarantee consistent beahvor, no matter > which paser gets used? > > > > *From:* Tim Allison <[email protected]> > *Sent:* Friday, February 26, 2021 3:17 PM > *To:* Peter Kronenberg <[email protected]> > *Cc:* [email protected]; [email protected] > *Subject:* Re: Re-using a TikaStream > > > > The stream.available() call comes from ProxyInputStream. We don't modify > that in TikaInputStream...maybe we should. > > > > TikaInputStream wraps an incoming InputStream in a BufferedInputStream if > it doesn't supportMark already. > > > > So, as long as you're happy with the performance and potential limitations > of BufferedInputStream, go with TikaInputStream. > > > > Note that some parsers have to spool to disk. TikaInputStream takes care > of this for you. > > > > On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg < > [email protected]> wrote: > > I think I figured this out. It seems to depend on what parser is used. > Not sure if this just has to do with inconsistent implementations, or there > is some reason behind it. > > > > For most audio files, using the AudioParser, the buffer is still at the > beginning. Even though there is no text extraction, I would think that > Tika still needs to read through the stream. > > The MP3Parser consumes the stream, but the MP4Parser does not > > > > The OCR parser also leaves the pointer at the beginning. It definitely > consumes the stream, so it must be resetting it. > > > > So what is going on. And now I get back to my original question, which > is, what is the best way to consistently be able to re-use the stream? > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Friday, February 26, 2021 12:18 PM > *To:* [email protected]; [email protected] > *Cc:* [email protected] > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > So is this guaranteed, expected behavior? > > > > With a BufferedInputStream – I expect this > > > > > *try *(BufferedInputStream stream = *new *BufferedInputStream(*new > *FileInputStream(file))) > { > System.*out*.printf(*"before - bytes available: %s"*, > stream.available()); > parser.parse(stream, handler, metadata, parseContext); > System.*out*.printf(*"after - bytes available: %s%n"*, > stream.available()); > } > > > > before - bytes available: 10546620 > > after - bytes available: 0 > > > > > > > > But with a TikaInputStream, I get this > > > > Note that I’m purposing creating a FileInputStream first in order to hide the > file information from the TikaInputStream, since in my normal use case, I’m > dealing with a regular InputStream, not reading from a file > > > *try *(TikaInputStream stream = TikaInputStream.*get*(*new > *FileInputStream(file))) { > System.*out*.printf(*"before - bytes available: %s, position: %s%n"*, > stream.available(), stream.getPosition()); > parser.parse(stream, handler, metadata, parseContext); > System.*out*.printf(*"after - bytes available: %s, position: %s%n"*, > stream.available(), stream.getPosition()); > } > > > > before - bytes available: 10546620, position: 0 > > after - bytes available: 10546620, position: 0 > > > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Thursday, February 25, 2021 11:28 AM > *To:* [email protected]; [email protected] > *Cc:* [email protected] > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > Or reading from the cloud, either Google or AWS, in which case I also get > a stream. I know what the file name is, but can’t really use it > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Thursday, February 25, 2021 11:19 AM > *To:* [email protected] > *Cc:* [email protected]; [email protected] > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > With a stream. I am reading arbitrary streams and one of the goals is to > figure out what it is. So there is no file backing it. > > > > *From:* Tim Allison <[email protected]> > *Sent:* Thursday, February 25, 2021 11:11 AM > *To:* Peter Kronenberg <[email protected]> > *Cc:* [email protected]; [email protected] > *Subject:* Re: Re-using a TikaStream > > > > Are you initializing w a file or a stream? > > > > On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg < > [email protected]> wrote: > > But how is TikaInputStream allowing me to re-use the stream without me > doing anything special? Is it automatically spooling to disk as needed? > > > > I wouldn’t say that I can’t afford to spool to disk. I’m just looking for > the most reasonable solution. I don’t know how big the streams are that > I’ll be processing. Obviously, if they’re big, the keeping them in memory > is not reasonable and disk is the only option. But for smaller streams, if > it can do it all in memory, that’s obviously better. And for my use case, > I don’t **always** have to re-read the stream. > > > > *From:* Tim Allison <[email protected]> > *Sent:* Thursday, February 25, 2021 5:48 AM > *To:* [email protected] > *Cc:* [email protected] > *Subject:* Re: Re-using a TikaStream > > > > My $0.02 would be to use TikaInputStream because that gets a lot more use > and is battle-tested. Within the last year or so, we started using > RereadableInputStream in one of the Microsoft format parsers so it is also > getting some use now. > > > > If you absolutely can't afford to spool to disk, then give > RereadableInputStream a try. > > > > The inputstreamfactories, in my mind, are somewhat work-arounds for other > use cases, e.g. retrying/batch etc. > > > > On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg < > [email protected]> wrote: > > So this might be moot, because it seems that TikaInputStream is already > doing some magic and I’m not sure how. > > I was able to re-use the stream without doing anything special after a > call to parse. And in fact, I displayed stream.available() and > stream.position() before and after the call to parse, and the full stream > was still available at position 0. What is TikaInputStream doing to make > this happen? > > > > Just for some additional context, what I’m doing is running the file > through Tika and then, depending on the file type, I want to do some > additional non-tika processing. I thought that once the Tika parse was > done, the stream would be used up. > > > > What is going on? > > > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Tuesday, February 23, 2021 10:00 AM > *To:* [email protected]; [email protected] > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > I just found the RereadableInputStream. This looks more like what I was > thinking. Is there any reason not to use it? What are the Tika best > practices? Pros/Cons of each approach? If RereadableInputStream works as > it’s supposed to, I’m not sure I see the advantage of InputStreamFactory > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Monday, February 22, 2021 8:30 PM > *To:* [email protected] > *Cc:* [email protected] > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > Oh ok. I didn’t realize I needed to write my own class to implement it. I > was looking for some sort of existing framework. > > > > What is the purpose of the 2 InputStreamFactory classes: > > > > I was re-reading some emails with Nick Burch back around Dec 22-23 and > maybe I mis-understood him, but it sounds like he was saying that > TiksInputStream was smart enough to automatically spool the stream to disk > to allow re-use. > > > > It seems to me that I need an extra pass through the data in order to save > to disk. I’m not starting from a File, but from a stream. So if I need to > read the stream twice, I really have to pass through the data 3 times, > correct? > > Unless there is a way to save to disk during the first pass > > > > (try/catch removed for simplicity) > > > > tis = TikaInputSream.get(InputStream); > > file = tis.getFile(); ç extra pass > > tis = TikaInputStream.get(new MyInputStreamFactory(file)); > > // first real pass > > InputStream is = tis.getInputStreamFactory().getInputStream() > > // second real pass > > } > > > > > > > > *From:* Luís Filipe Nassif <[email protected]> > *Sent:* Monday, February 22, 2021 5:42 PM > *To:* Peter Kronenberg <[email protected]> > *Cc:* [email protected] > *Subject:* Re: Re-using a TikaStream > > > > Something like: > > > > class MyInputStreamFactory implements InputStreamFactory{ > > > > private File file; > > > > public MyInputStreamFactory(File file){ > > this.file = file; > > } > > > > public InputStream getInputStream(){ > > return new FileInputStream(file); > > } > > } > > > > in your client code: > > > > Parser parser = new AutoDetectParser(); > > TikaInputStream tis = TikaInputStream.get(new MyInputStreamFactory(file)); > > parser.parse(tis, new ToTextContentHandler(), new Metadata(), new > ParseContext()); > > > > when you need to reuse the stream (into your parser): > > > > public void parse(InputStream stream, ContentHandler handler, Metadata > metadata, ParseContext context) > throws IOException, SAXException, TikaException { > > //(...) > > TikaInputStream tis = TikaInputStream.get(stream); > > if(tis.hasInputStreamFactory()){ > > try(InputStream is = tis.getInputStreamFactory().getInputStream()){ > > //consume the new stream > > } > > }else > > throw new IOException("not a reusable inputStream"); > > } > > > > Of course this is useful if you are not processing files, e.g. reading > files from the cloud or sockets. > > > > Regards, > > Luis > > > > > > Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg < > [email protected]> escreveu: > > I sent this question late on Friday. Sending it again. Can you provide a > little more information how out to use the InputStreamFactory? > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Friday, February 19, 2021 5:10 PM > *To:* [email protected]; [email protected] > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > There appear to be 2 InputStreamFactory classes: in tika-server-core and > tika-io. The one in server.core is the only one with a concrete class. > > I’m not quite sure I see how to use this. > > Normally, I create a TikaInputStream with > TikaInputStream.get(InputStream). How do I create it from an > InputStreamFactory? > > TikaInputStream.getInputStreamFactory() only returns a factory if the > TikaInputStream was created from a factory. > > Is there a good example of how this is used > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Friday, February 19, 2021 4:57 PM > *To:* [email protected]; [email protected] > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > Thanks. I thought that TikaInputStream already automatically saved to > disk to allow re-reading. > > > > *From:* Luís Filipe Nassif <[email protected]> > *Sent:* Friday, February 19, 2021 3:44 PM > *To:* [email protected] > *Subject:* Re: Re-using a TikaStream > > > > You could call TikaInputStream.getPath() at the beginning of your parser, > it will spool to file if not file based. After consuming the original > inputStream, create a new one from the temp file created. > > > > If you are using 2.0.0-ALPHA, there is: > > > > > https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java > > > > Use with the new methods from TikaInputStream: > > public static TikaInputStream get(InputStreamFactory factory) > > public InputStreamFactory getInputStreamFactory() > > > > Hope this helps, > > Luis > > > > Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg < > [email protected]> escreveu: > > If I finish parsing a TikaStream, can I re-use the stream (before it is > closed)? I know you said that there is some magic behind the scenes where > it spools it to a file. Can I just call reset() to start from the > beginning? > > > > Peter > > > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] <http://www.torch.ai/> > > 4303 W. 119th St., Leawood, KS 66209 > <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g> > WWW.TORCH.AI <http://www.torch.ai/> > > > > > >
