Let me preface my remarks by saying, I'm mystified how to use ContentHandler to do anything complicated.

It seems like the semantics of getting the content out of a ContentHandler is wrong, or at least shortsighted. The user has two options on how to use the text provided by ContentHandler. The user can provide an OutputStream, which ContentHandler will write() the the bytes to in as it reads the InputStream associated with the file, or the user can have ContentHandler buffer the entire parsed contents of the file into memory and then get back a humungous String via ContentHandler.toString() .

There needs to be a better way.

Writing the bytes to an OutputStream pretty much locks the bytes up so that the only thing you can do is write them to some sort of device whether it's the console, disk, or a network connection. Buffering the entire file is simply a not an option for very large files. For very large files, you need to process chunks of the file, like from a stream, or better yet, a series of callbacks with a relatively small buffer (say even a few megs). (This is how SAX does it.) By using a callback system, the user is free to do whatever he/she wants to do with each chunk. If he/she wants to blast it to the disk, a simple OutputStream.write(buf) is good enough. If they want to do some more parsing of the text (like I want to do) then he/she can that as well without reading the entire file into memory.

Here's my scenario that prompted this email:

I'm reading a bunch of files of a variety of types. Some of these files can be quite large. Like gigabytes. I'm using AutoDetectParser to handle the approrpriate parsing and BodyContentHandler to extract out the plaintext. I want to take the extracted plaintext, do some analysis on it, and then index the plaintext along with results of my analysis. Specifically, my analysis requires taking the extracted plaintext, segmenting it into sentences and doing part of speech tagging and morphalogical analysis (ie stemmming) via an external process. This mean I can't use an OutputStream since you can't read from an OutputStream, so I'm stuck with using ContentHandler.toString() which can (and does) exhaust memory for large files.

What I really want is someone to tell me how to get back a usable stream of plaintext, whether this involves a radical change to Tika's ContentHandler class or some trick with Java, I really don't care, as long as it's single thread save. (Java's PipedInputStream and PipedOutputStream are not single thread safe.)

I know I can't be only one that's had or will have this problem. It really seems like this use case needs to be handled, because the use case that Tika currently seems to be designed for is "Write plaintext to the disk."

Thanks.

--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/


Reply via email to