ContentHandler's OutputStream

Jonathan Koren Wed, 04 Feb 2009 18:02:34 -0800

Let me preface my remarks by saying, I'm mystified how to useContentHandler to do anything complicated.

It seems like the semantics of getting the content out of aContentHandler is wrong, or at least shortsighted. The user has twooptions on how to use the text provided by ContentHandler. The usercan provide an OutputStream, which ContentHandler will write() the thebytes to in as it reads the InputStream associated with the file, orthe user can have ContentHandler buffer the entire parsed contents ofthe file into memory and then get back a humungous String viaContentHandler.toString() .


There needs to be a better way.

Writing the bytes to an OutputStream pretty much locks the bytes up sothat the only thing you can do is write them to some sort of devicewhether it's the console, disk, or a network connection. Bufferingthe entire file is simply a not an option for very large files. Forvery large files, you need to process chunks of the file, like from astream, or better yet, a series of callbacks with a relatively smallbuffer (say even a few megs). (This is how SAX does it.) By using acallback system, the user is free to do whatever he/she wants to dowith each chunk. If he/she wants to blast it to the disk, a simpleOutputStream.write(buf) is good enough. If they want to do some moreparsing of the text (like I want to do) then he/she can that as wellwithout reading the entire file into memory.


Here's my scenario that prompted this email:

I'm reading a bunch of files of a variety of types. Some of thesefiles can be quite large. Like gigabytes. I'm using AutoDetectParserto handle the approrpriate parsing and BodyContentHandler to extractout the plaintext. I want to take the extracted plaintext, do someanalysis on it, and then index the plaintext along with results of myanalysis. Specifically, my analysis requires taking the extractedplaintext, segmenting it into sentences and doing part of speechtagging and morphalogical analysis (ie stemmming) via an externalprocess. This mean I can't use an OutputStream since you can't readfrom an OutputStream, so I'm stuck with usingContentHandler.toString() which can (and does) exhaust memory forlarge files.

What I really want is someone to tell me how to get back a usablestream of plaintext, whether this involves a radical change to Tika'sContentHandler class or some trick with Java, I really don't care, aslong as it's single thread save. (Java's PipedInputStream andPipedOutputStream are not single thread safe.)

I know I can't be only one that's had or will have this problem. Itreally seems like this use case needs to be handled, because the usecase that Tika currently seems to be designed for is "Write plaintextto the disk."


Thanks.

--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/

ContentHandler's OutputStream

Reply via email to