Hi,

On Thu, Feb 5, 2009 at 3:02 AM, Jonathan Koren <jonat...@soe.ucsc.edu> wrote:
> What I really want is someone to tell me how to get back a usable stream of
> plaintext, whether this involves a radical change to Tika's ContentHandler
> class or some trick with Java, I really don't care, as long as it's single
> thread save.

Have you looked at the ParsingReader class? It seems like a perfect
match to your needs. The ParsingReader class fires a background thread
to do the parsing and pipes the output so you can control when and how
you want to read the extracted text.

Alternatively, if the extra thread is not acceptable, you implement a
custom ContentHandler that directly catches and processes the
characters() and ignorableWhitespace() events.

Or you could subclass Writer and treat the write() calls as callbacks
from the parser.

BR,

Jukka Zitting

Reply via email to