I have a tar file that I want to index the contents of as separate
files. To do this, I hooked up an AutoDetectParser to a
ParsingReader. I'm using ParsingReader since total uncompressed
contents of the tars can be quite large.
If I understand how AutoDetectParser works, it figures out that the
file is a tar, and thus fires off a TarParser which is a type of
PackageParser. The PackageParser reads the tar, and sends SAX events
to some Tika internal representation of the file. Specifically, it
sends magic DIVs delimitating the contents of each file, which in
turn are parsed by another AutoDetectParser. The complete sequence of
SAX events for the entire tar file from the outermost AutoDetectParser
to ParsingReader.
Here's where things go off the track. The output stream of
ParsingReader is *plain text*, meaning that it is now impossible to
determine where one file within the tar ends, and where the next file
begins. Poking around within ParsingReader shows that the SAX events
are being passed through a BodyContentHandler, which when constructed
with the default constructor, only writes out the characters of the
XML stream. (i.e. performing an XML to text conversion).
It seems like there either needs to be a way for ParsingReader to
associate a ContentHandler with its internal BodyContentHandler, or
the default action for BodyContentHandler should be to send the XML
directly, and not convert it to plain text.
Oh, and subclassing ParsingReader isn't an option without essentially
reimplementing it since the problematic BodyContentHandler is
instantiated within the private ParsingThread class.
Ideas? Suggestions?
--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/