Hi, On Thu, Feb 5, 2009 at 3:02 AM, Jonathan Koren <jonat...@soe.ucsc.edu> wrote: > What I really want is someone to tell me how to get back a usable stream of > plaintext, whether this involves a radical change to Tika's ContentHandler > class or some trick with Java, I really don't care, as long as it's single > thread save.
Have you looked at the ParsingReader class? It seems like a perfect match to your needs. The ParsingReader class fires a background thread to do the parsing and pipes the output so you can control when and how you want to read the extracted text. Alternatively, if the extra thread is not acceptable, you implement a custom ContentHandler that directly catches and processes the characters() and ignorableWhitespace() events. Or you could subclass Writer and treat the write() calls as callbacks from the parser. BR, Jukka Zitting