ParsingReader and PackageParser

Jonathan Koren Wed, 25 Feb 2009 21:32:23 -0800

I have a tar file that I want to index the contents of as separatefiles. To do this, I hooked up an AutoDetectParser to aParsingReader. I'm using ParsingReader since total uncompressedcontents of the tars can be quite large.

If I understand how AutoDetectParser works, it figures out that thefile is a tar, and thus fires off a TarParser which is a type ofPackageParser. The PackageParser reads the tar, and sends SAX eventsto some Tika internal representation of the file. Specifically, itsends magic DIVs delimitating the contents of each file, which inturn are parsed by another AutoDetectParser. The complete sequence ofSAX events for the entire tar file from the outermost AutoDetectParserto ParsingReader.

Here's where things go off the track. The output stream ofParsingReader is *plain text*, meaning that it is now impossible todetermine where one file within the tar ends, and where the next filebegins. Poking around within ParsingReader shows that the SAX eventsare being passed through a BodyContentHandler, which when constructedwith the default constructor, only writes out the characters of theXML stream. (i.e. performing an XML to text conversion).

It seems like there either needs to be a way for ParsingReader toassociate a ContentHandler with its internal BodyContentHandler, orthe default action for BodyContentHandler should be to send the XMLdirectly, and not convert it to plain text.

Oh, and subclassing ParsingReader isn't an option without essentiallyreimplementing it since the problematic BodyContentHandler isinstantiated within the private ParsingThread class.


Ideas?  Suggestions?

--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/

ParsingReader and PackageParser

Reply via email to