Re: Re-using a TikaStream

Tim Allison Mon, 01 Mar 2021 07:31:22 -0800

detectors should return the stream reset to the beginning.

Parsers, IIRC, should return the stream fully(?) read but not closed.


On Mon, Mar 1, 2021 at 10:29 AM Tim Allison <[email protected]> wrote:

> Reusing streams after parsing hasn't been something I've done before...
>
> This is not expected behavior.  Parsers should all behave the same.
>
> On Mon, Mar 1, 2021 at 10:24 AM Peter Kronenberg <
> [email protected]> wrote:
>
>> After more testing, it seems that it has nothing to do with
>> TikaInputStream.  I just passed in a BufferedInputStream to the parsers.  I
>> see that the first thing the AutoDetactParser does is to convert it to a
>> TikaInputStream.  So maybe TIS is being leveraged at a lower level, but
>> there no reason for me to use the TIS at my level.
>>
>> But the issue is that different parsers return the stream in different
>> states.  Sometimes the stream is all used up (although not closed). And
>> other times, the stream has been re-set to the beginning where it can be
>> re-used.  Is this expected behavior?
>>
>>
>>
>>
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Friday, February 26, 2021 10:03 PM
>> *To:* [email protected]
>> *Cc:* [email protected]; [email protected]
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> But as I said, this doesn’t seem to work with all parsers.    So let’s
>> say I pass in an MP4 file which uses the MP4Parser and then I want to
>> re-use the stream afterward.  How can I guarantee consistent beahvor, no
>> matter which paser gets used?
>>
>>
>>
>> *From:* Tim Allison <[email protected]>
>> *Sent:* Friday, February 26, 2021 3:17 PM
>> *To:* Peter Kronenberg <[email protected]>
>> *Cc:* [email protected]; [email protected]
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> The stream.available() call comes from ProxyInputStream.  We don't modify
>> that in TikaInputStream...maybe we should.
>>
>>
>>
>> TikaInputStream wraps an incoming InputStream in a BufferedInputStream if
>> it doesn't supportMark already.
>>
>>
>>
>> So, as long as you're happy with the performance and potential
>> limitations of BufferedInputStream, go with TikaInputStream.
>>
>>
>>
>> Note that some parsers have to spool to disk.  TikaInputStream takes care
>> of this for you.
>>
>>
>>
>> On Fri, Feb 26, 2021 at 1:01 PM Peter Kronenberg <
>> [email protected]> wrote:
>>
>> I think I figured this out.  It seems to depend on what parser is used.
>> Not sure if this just has to do with inconsistent implementations, or there
>> is some reason behind it.
>>
>>
>>
>> For most audio files, using the AudioParser, the buffer is still at the
>> beginning.  Even though there is no text extraction, I would think that
>> Tika still needs to read through the stream.
>>
>> The MP3Parser consumes the stream, but the MP4Parser does not
>>
>>
>>
>> The OCR parser also leaves the pointer at the beginning.  It definitely
>> consumes the stream, so it must be resetting it.
>>
>>
>>
>> So what is going on.  And now I get back to my original question, which
>> is, what is the best way to consistently be able to re-use the stream?
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Friday, February 26, 2021 12:18 PM
>> *To:* [email protected]; [email protected]
>> *Cc:* [email protected]
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> So is this guaranteed, expected behavior?
>>
>>
>>
>> With a BufferedInputStream – I expect this
>>
>>
>>
>>
>> *try *(BufferedInputStream stream = *new *BufferedInputStream(*new 
>> *FileInputStream(file)))
>> {
>>     System.*out*.printf(*"before - bytes available: %s"*,
>> stream.available());
>>     parser.parse(stream, handler, metadata, parseContext);
>>     System.*out*.printf(*"after - bytes available: %s%n"*,
>> stream.available());
>> }
>>
>>
>>
>> before - bytes available: 10546620
>>
>> after - bytes available: 0
>>
>>
>>
>>
>>
>>
>>
>> But with a TikaInputStream, I get this
>>
>>
>>
>> Note that I’m purposing creating a FileInputStream first in order to hide 
>> the file information from the TikaInputStream, since in my normal use case, 
>> I’m dealing with a regular InputStream, not reading from a file
>>
>>
>> *try *(TikaInputStream stream = TikaInputStream.*get*(*new 
>> *FileInputStream(file))) {
>>     System.*out*.printf(*"before - bytes available: %s, position: %s%n"*, 
>> stream.available(), stream.getPosition());
>>     parser.parse(stream, handler, metadata, parseContext);
>>     System.*out*.printf(*"after - bytes available: %s, position: %s%n"*, 
>> stream.available(), stream.getPosition());
>> }
>>
>>
>>
>> before - bytes available: 10546620, position: 0
>>
>> after - bytes available: 10546620, position: 0
>>
>>
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Thursday, February 25, 2021 11:28 AM
>> *To:* [email protected]; [email protected]
>> *Cc:* [email protected]
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> Or reading from the cloud, either Google or AWS, in which case I also get
>> a stream.   I know what the file name is, but can’t really use it
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Thursday, February 25, 2021 11:19 AM
>> *To:* [email protected]
>> *Cc:* [email protected]; [email protected]
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> With a stream.  I am reading arbitrary streams and one of the goals is to
>> figure out what it is. So there is no file backing it.
>>
>>
>>
>> *From:* Tim Allison <[email protected]>
>> *Sent:* Thursday, February 25, 2021 11:11 AM
>> *To:* Peter Kronenberg <[email protected]>
>> *Cc:* [email protected]; [email protected]
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> Are you initializing w a file or a stream?
>>
>>
>>
>> On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <
>> [email protected]> wrote:
>>
>> But how is TikaInputStream allowing me to re-use the stream without me
>> doing anything special?   Is it automatically spooling to disk as needed?
>>
>>
>>
>> I wouldn’t say that I can’t afford to spool to disk.  I’m just looking
>> for the most reasonable solution.  I don’t know how big the streams are
>> that I’ll be processing.  Obviously, if they’re big, the keeping them in
>> memory is not reasonable and disk is the only option.  But for smaller
>> streams, if it can do it all in memory, that’s obviously better.  And for
>> my use case, I don’t **always** have to re-read the stream.
>>
>>
>>
>> *From:* Tim Allison <[email protected]>
>> *Sent:* Thursday, February 25, 2021 5:48 AM
>> *To:* [email protected]
>> *Cc:* [email protected]
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> My $0.02 would be to use TikaInputStream because that gets a lot more use
>> and is battle-tested.  Within the last year or so, we started using
>> RereadableInputStream in one of the Microsoft format parsers so it is also
>> getting some use now.
>>
>>
>>
>> If you absolutely can't afford to spool to disk, then give
>> RereadableInputStream a try.
>>
>>
>>
>> The inputstreamfactories, in my mind, are somewhat work-arounds for other
>> use cases, e.g. retrying/batch etc.
>>
>>
>>
>> On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <
>> [email protected]> wrote:
>>
>> So this might be moot, because it seems that TikaInputStream is already
>> doing some magic and I’m not sure how.
>>
>> I was able to re-use the stream without doing anything special after a
>> call to parse.  And in fact, I displayed stream.available() and
>> stream.position() before and after the call to parse, and the full stream
>> was still available at position 0.  What is TikaInputStream doing to make
>> this happen?
>>
>>
>>
>> Just for some additional context, what I’m doing is running the file
>> through Tika and then, depending on the file type, I want to do some
>> additional non-tika processing.  I thought that once the Tika parse was
>> done, the stream would be used up.
>>
>>
>>
>> What is going on?
>>
>>
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Tuesday, February 23, 2021 10:00 AM
>> *To:* [email protected]; [email protected]
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> I just found the RereadableInputStream.  This looks more like what I was
>> thinking.  Is there any reason not to use it?  What are the Tika best
>> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
>> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Monday, February 22, 2021 8:30 PM
>> *To:* [email protected]
>> *Cc:* [email protected]
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> Oh ok.  I didn’t realize I needed to write my own class to implement it.
>> I  was looking for some sort of existing framework.
>>
>>
>>
>> What is the purpose of the 2 InputStreamFactory classes:
>>
>>
>>
>> I was re-reading some emails with Nick Burch back around Dec 22-23 and
>> maybe I mis-understood him, but it sounds like he was saying that
>> TiksInputStream was smart enough to automatically spool the stream to disk
>> to allow re-use.
>>
>>
>>
>> It seems to me that I need an extra pass through the data in order to
>> save to disk.  I’m not starting from a File, but from a stream.  So if I
>> need to read the stream twice, I really have to pass through the data 3
>> times, correct?
>>
>> Unless there is a way to save to disk during the first pass
>>
>>
>>
>> (try/catch removed for simplicity)
>>
>>
>>
>> tis = TikaInputSream.get(InputStream);
>>
>> file = tis.getFile();   ç extra pass
>>
>> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>>
>> // first real pass
>>
>> InputStream is = tis.getInputStreamFactory().getInputStream()
>>
>> // second real pass
>>
>> }
>>
>>
>>
>>
>>
>>
>>
>> *From:* Luís Filipe Nassif <[email protected]>
>> *Sent:* Monday, February 22, 2021 5:42 PM
>> *To:* Peter Kronenberg <[email protected]>
>> *Cc:* [email protected]
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> Something like:
>>
>>
>>
>> class MyInputStreamFactory implements InputStreamFactory{
>>
>>
>>
>>     private File file;
>>
>>
>>
>>     public  MyInputStreamFactory(File file){
>>
>>         this.file = file;
>>
>>     }
>>
>>
>>
>>     public InputStream getInputStream(){
>>
>>         return new FileInputStream(file);
>>
>>     }
>>
>> }
>>
>>
>>
>> in your client code:
>>
>>
>>
>> Parser parser = new AutoDetectParser();
>>
>> TikaInputStream tis =  TikaInputStream.get(new
>> MyInputStreamFactory(file));
>>
>> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
>> ParseContext());
>>
>>
>>
>> when you need to reuse the stream (into your parser):
>>
>>
>>
>> public void parse(InputStream stream, ContentHandler handler, Metadata
>> metadata, ParseContext context)
>>             throws IOException, SAXException, TikaException {
>>
>>    //(...)
>>
>>    TikaInputStream tis = TikaInputStream.get(stream);
>>
>>    if(tis.hasInputStreamFactory()){
>>
>>         try(InputStream is =
>> tis.getInputStreamFactory().getInputStream()){
>>
>>               //consume the new stream
>>
>>         }
>>
>>    }else
>>
>>        throw new IOException("not a reusable inputStream");
>>
>>  }
>>
>>
>>
>> Of course this is useful if you are not processing files, e.g. reading
>> files from the cloud or sockets.
>>
>>
>>
>> Regards,
>>
>> Luis
>>
>>
>>
>>
>>
>> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
>> [email protected]> escreveu:
>>
>> I sent this question late on Friday.  Sending it again.  Can you provide
>> a little more information how out to use the InputStreamFactory?
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Friday, February 19, 2021 5:10 PM
>> *To:* [email protected]; [email protected]
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> There appear to be 2 InputStreamFactory classes: in tika-server-core and
>> tika-io.  The one in server.core is the only one with a concrete class.
>>
>> I’m not quite sure I see how to use this.
>>
>> Normally, I create a TikaInputStream with
>> TikaInputStream.get(InputStream).  How do I create it from an
>> InputStreamFactory?
>>
>> TikaInputStream.getInputStreamFactory() only returns a factory if the
>> TikaInputStream was created from a factory.
>>
>> Is there a good example of how this is used
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Friday, February 19, 2021 4:57 PM
>> *To:* [email protected]; [email protected]
>> *Subject:* RE: Re-using a TikaStream
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> Thanks.  I thought that TikaInputStream already automatically saved to
>> disk to allow re-reading.
>>
>>
>>
>> *From:* Luís Filipe Nassif <[email protected]>
>> *Sent:* Friday, February 19, 2021 3:44 PM
>> *To:* [email protected]
>> *Subject:* Re: Re-using a TikaStream
>>
>>
>>
>> You could call TikaInputStream.getPath() at the beginning of your parser,
>> it will spool to file if not file based. After consuming the original
>> inputStream, create a new one from the temp file created.
>>
>>
>>
>> If you are using 2.0.0-ALPHA, there is:
>>
>>
>>
>>
>> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
>>
>>
>>
>> Use with the new methods from TikaInputStream:
>>
>> public static TikaInputStream get(InputStreamFactory factory)
>>
>> public InputStreamFactory getInputStreamFactory()
>>
>>
>>
>> Hope this helps,
>>
>> Luis
>>
>>
>>
>> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
>> [email protected]> escreveu:
>>
>> If I finish parsing a TikaStream, can I re-use the stream (before it is
>> closed)?  I know you said that there is some magic behind the scenes where
>> it spools it to a file.  Can I just call reset() to start from the
>> beginning?
>>
>>
>>
>> Peter
>>
>>
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623*
>>
>> [image: Torch AI] <http://www.torch.ai/>
>>
>> 4303 W. 119th St., Leawood, KS 66209
>> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
>> WWW.TORCH.AI <http://www.torch.ai/>
>>
>>
>>
>>
>>
>>

Re: Re-using a TikaStream

Reply via email to