So is this guaranteed, expected behavior?
With a BufferedInputStream – I expect this
try (BufferedInputStream stream = new BufferedInputStream(new
FileInputStream(file))) {
System.out.printf("before - bytes available: %s", stream.available());
parser.parse(stream, handler, metadata, parseContext);
System.out.printf("after - bytes available: %s%n", stream.available());
}
before - bytes available: 10546620
after - bytes available: 0
But with a TikaInputStream, I get this
Note that I’m purposing creating a FileInputStream first in order to hide the
file information from the TikaInputStream, since in my normal use case, I’m
dealing with a regular InputStream, not reading from a file
try (TikaInputStream stream = TikaInputStream.get(new FileInputStream(file))) {
System.out.printf("before - bytes available: %s, position: %s%n",
stream.available(), stream.getPosition());
parser.parse(stream, handler, metadata, parseContext);
System.out.printf("after - bytes available: %s, position: %s%n",
stream.available(), stream.getPosition());
}
before - bytes available: 10546620, position: 0
after - bytes available: 10546620, position: 0
From: Peter Kronenberg <[email protected]>
Sent: Thursday, February 25, 2021 11:28 AM
To: [email protected]; [email protected]
Cc: [email protected]
Subject: RE: Re-using a TikaStream
This email was sent from outside your organisation, yet is displaying the name
of someone from your organisation. This often happens in phishing attempts.
Please only interact with this email if you know its source and that the
content is safe.
Or reading from the cloud, either Google or AWS, in which case I also get a
stream. I know what the file name is, but can’t really use it
From: Peter Kronenberg
<[email protected]<mailto:[email protected]>>
Sent: Thursday, February 25, 2021 11:19 AM
To: [email protected]<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>;
[email protected]<mailto:[email protected]>
Subject: RE: Re-using a TikaStream
This email was sent from outside your organisation, yet is displaying the name
of someone from your organisation. This often happens in phishing attempts.
Please only interact with this email if you know its source and that the
content is safe.
With a stream. I am reading arbitrary streams and one of the goals is to
figure out what it is. So there is no file backing it.
From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>;
[email protected]<mailto:[email protected]>
Subject: Re: Re-using a TikaStream
Are you initializing w a file or a stream?
On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg
<[email protected]<mailto:[email protected]>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing
anything special? Is it automatically spooling to disk as needed?
I wouldn’t say that I can’t afford to spool to disk. I’m just looking for the
most reasonable solution. I don’t know how big the streams are that I’ll be
processing. Obviously, if they’re big, the keeping them in memory is not
reasonable and disk is the only option. But for smaller streams, if it can do
it all in memory, that’s obviously better. And for my use case, I don’t
*always* have to re-read the stream.
From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Thursday, February 25, 2021 5:48 AM
To: [email protected]<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Re-using a TikaStream
My $0.02 would be to use TikaInputStream because that gets a lot more use and
is battle-tested. Within the last year or so, we started using
RereadableInputStream in one of the Microsoft format parsers so it is also
getting some use now.
If you absolutely can't afford to spool to disk, then give
RereadableInputStream a try.
The inputstreamfactories, in my mind, are somewhat work-arounds for other use
cases, e.g. retrying/batch etc.
On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg
<[email protected]<mailto:[email protected]>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing
some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to
parse. And in fact, I displayed stream.available() and stream.position()
before and after the call to parse, and the full stream was still available at
position 0. What is TikaInputStream doing to make this happen?
Just for some additional context, what I’m doing is running the file through
Tika and then, depending on the file type, I want to do some additional
non-tika processing. I thought that once the Tika parse was done, the stream
would be used up.
What is going on?
From: Peter Kronenberg
<[email protected]<mailto:[email protected]>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: [email protected]<mailto:[email protected]>;
[email protected]<mailto:[email protected]>
Subject: RE: Re-using a TikaStream
This email was sent from outside your organisation, yet is displaying the name
of someone from your organisation. This often happens in phishing attempts.
Please only interact with this email if you know its source and that the
content is safe.
I just found the RereadableInputStream. This looks more like what I was
thinking. Is there any reason not to use it? What are the Tika best
practices? Pros/Cons of each approach? If RereadableInputStream works as it’s
supposed to, I’m not sure I see the advantage of InputStreamFactory
From: Peter Kronenberg
<[email protected]<mailto:[email protected]>>
Sent: Monday, February 22, 2021 8:30 PM
To: [email protected]<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>
Subject: RE: Re-using a TikaStream
This email was sent from outside your organisation, yet is displaying the name
of someone from your organisation. This often happens in phishing attempts.
Please only interact with this email if you know its source and that the
content is safe.
Oh ok. I didn’t realize I needed to write my own class to implement it. I was
looking for some sort of existing framework.
What is the purpose of the 2 InputStreamFactory classes:
I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I
mis-understood him, but it sounds like he was saying that TiksInputStream was
smart enough to automatically spool the stream to disk to allow re-use.
It seems to me that I need an extra pass through the data in order to save to
disk. I’m not starting from a File, but from a stream. So if I need to read
the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass
(try/catch removed for simplicity)
tis = TikaInputSream.get(InputStream);
file = tis.getFile(); <== extra pass
tis = TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}
From: Luís Filipe Nassif <[email protected]<mailto:[email protected]>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Re-using a TikaStream
Something like:
class MyInputStreamFactory implements InputStreamFactory{
private File file;
public MyInputStreamFactory(File file){
this.file = file;
}
public InputStream getInputStream(){
return new FileInputStream(file);
}
}
in your client code:
Parser parser = new AutoDetectParser();
TikaInputStream tis = TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
ParseContext());
when you need to reuse the stream (into your parser):
public void parse(InputStream stream, ContentHandler handler, Metadata
metadata, ParseContext context)
throws IOException, SAXException, TikaException {
//(...)
TikaInputStream tis = TikaInputStream.get(stream);
if(tis.hasInputStreamFactory()){
try(InputStream is = tis.getInputStreamFactory().getInputStream()){
//consume the new stream
}
}else
throw new IOException("not a reusable inputStream");
}
Of course this is useful if you are not processing files, e.g. reading files
from the cloud or sockets.
Regards,
Luis
Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg
<[email protected]<mailto:[email protected]>> escreveu:
I sent this question late on Friday. Sending it again. Can you provide a
little more information how out to use the InputStreamFactory?
From: Peter Kronenberg
<[email protected]<mailto:[email protected]>>
Sent: Friday, February 19, 2021 5:10 PM
To: [email protected]<mailto:[email protected]>;
[email protected]<mailto:[email protected]>
Subject: RE: Re-using a TikaStream
This email was sent from outside your organisation, yet is displaying the name
of someone from your organisation. This often happens in phishing attempts.
Please only interact with this email if you know its source and that the
content is safe.
There appear to be 2 InputStreamFactory classes: in tika-server-core and
tika-io. The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).
How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the
TikaInputStream was created from a factory.
Is there a good example of how this is used
From: Peter Kronenberg
<[email protected]<mailto:[email protected]>>
Sent: Friday, February 19, 2021 4:57 PM
To: [email protected]<mailto:[email protected]>;
[email protected]<mailto:[email protected]>
Subject: RE: Re-using a TikaStream
This email was sent from outside your organisation, yet is displaying the name
of someone from your organisation. This often happens in phishing attempts.
Please only interact with this email if you know its source and that the
content is safe.
Thanks. I thought that TikaInputStream already automatically saved to disk to
allow re-reading.
From: Luís Filipe Nassif <[email protected]<mailto:[email protected]>>
Sent: Friday, February 19, 2021 3:44 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Re-using a TikaStream
You could call TikaInputStream.getPath() at the beginning of your parser, it
will spool to file if not file based. After consuming the original inputStream,
create a new one from the temp file created.
If you are using 2.0.0-ALPHA, there is:
https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
Use with the new methods from TikaInputStream:
public static TikaInputStream get(InputStreamFactory factory)
public InputStreamFactory getInputStreamFactory()
Hope this helps,
Luis
Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg
<[email protected]<mailto:[email protected]>> escreveu:
If I finish parsing a TikaStream, can I re-use the stream (before it is
closed)? I know you said that there is some magic behind the scenes where it
spools it to a file. Can I just call reset() to start from the beginning?
Peter
Peter Kronenberg | Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS
66209<https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
WWW.TORCH.AI<http://www.torch.ai/>