On Feb 26, 2009, at 5:52 AM, Jukka Zitting wrote:

Hi,

On Thu, Feb 26, 2009 at 6:31 AM, Jonathan Koren <jonat...@soe.ucsc.edu> wrote:
Ideas?  Suggestions?

If you need special processing for tar files, then the best
alternative is probably to use the TarInputStream class directly, and
use the higher level Tika parsers only for parsing the individual tar
entries.

If you need such processing to be an integral part of Tika, then you
can wrap your custom logic into a Parser class and modify your
configuration to use that parser instead of the default TarParser for
tar files.

I was originally thinking about some way of having ParsingReader set a ContentHandler for its internal BodyContentHandler? As it's setup now, you can't get sax events at all with a ParsingReader. Unfortunately, there doesn't really seem to be a clean or general way to do that.

Actually, if ParsingReader had some sort of mode where it spat out the xml directly instead of (indirectly) using WriteOutContentHandler to convert everything to plain text, then one could whatever xml parser, including an xml to text converter, on the read side. As it is, it seems like ParsingReader is being just a little too smart.

Comments?

--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/


Reply via email to