Let me preface my remarks by saying, I'm mystified how to use
ContentHandler to do anything complicated.
It seems like the semantics of getting the content out of a
ContentHandler is wrong, or at least shortsighted. The user has two
options on how to use the text provided by ContentHandler. The user
can provide an OutputStream, which ContentHandler will write() the the
bytes to in as it reads the InputStream associated with the file, or
the user can have ContentHandler buffer the entire parsed contents of
the file into memory and then get back a humungous String via
ContentHandler.toString() .
There needs to be a better way.
Writing the bytes to an OutputStream pretty much locks the bytes up so
that the only thing you can do is write them to some sort of device
whether it's the console, disk, or a network connection. Buffering
the entire file is simply a not an option for very large files. For
very large files, you need to process chunks of the file, like from a
stream, or better yet, a series of callbacks with a relatively small
buffer (say even a few megs). (This is how SAX does it.) By using a
callback system, the user is free to do whatever he/she wants to do
with each chunk. If he/she wants to blast it to the disk, a simple
OutputStream.write(buf) is good enough. If they want to do some more
parsing of the text (like I want to do) then he/she can that as well
without reading the entire file into memory.
Here's my scenario that prompted this email:
I'm reading a bunch of files of a variety of types. Some of these
files can be quite large. Like gigabytes. I'm using AutoDetectParser
to handle the approrpriate parsing and BodyContentHandler to extract
out the plaintext. I want to take the extracted plaintext, do some
analysis on it, and then index the plaintext along with results of my
analysis. Specifically, my analysis requires taking the extracted
plaintext, segmenting it into sentences and doing part of speech
tagging and morphalogical analysis (ie stemmming) via an external
process. This mean I can't use an OutputStream since you can't read
from an OutputStream, so I'm stuck with using
ContentHandler.toString() which can (and does) exhaust memory for
large files.
What I really want is someone to tell me how to get back a usable
stream of plaintext, whether this involves a radical change to Tika's
ContentHandler class or some trick with Java, I really don't care, as
long as it's single thread save. (Java's PipedInputStream and
PipedOutputStream are not single thread safe.)
I know I can't be only one that's had or will have this problem. It
really seems like this use case needs to be handled, because the use
case that Tika currently seems to be designed for is "Write plaintext
to the disk."
Thanks.
--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/