I have a tar file that I want to index the contents of as separate files. To do this, I hooked up an AutoDetectParser to a ParsingReader. I'm using ParsingReader since total uncompressed contents of the tars can be quite large.

If I understand how AutoDetectParser works, it figures out that the file is a tar, and thus fires off a TarParser which is a type of PackageParser. The PackageParser reads the tar, and sends SAX events to some Tika internal representation of the file. Specifically, it sends magic DIVs delimitating the contents of each file, which in turn are parsed by another AutoDetectParser. The complete sequence of SAX events for the entire tar file from the outermost AutoDetectParser to ParsingReader.

Here's where things go off the track. The output stream of ParsingReader is *plain text*, meaning that it is now impossible to determine where one file within the tar ends, and where the next file begins. Poking around within ParsingReader shows that the SAX events are being passed through a BodyContentHandler, which when constructed with the default constructor, only writes out the characters of the XML stream. (i.e. performing an XML to text conversion).

It seems like there either needs to be a way for ParsingReader to associate a ContentHandler with its internal BodyContentHandler, or the default action for BodyContentHandler should be to send the XML directly, and not convert it to plain text.

Oh, and subclassing ParsingReader isn't an option without essentially reimplementing it since the problematic BodyContentHandler is instantiated within the private ParsingThread class.

Ideas?  Suggestions?

--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/


Reply via email to