Title: Slow SAX parsing of large CDATA?

I'm using SAX (Xerces 2.3.0 on Windows XP) to parse an XML file that can contain large CDATA sections (where large is somewhere between 1 and 5 Mb). The data is Base64-encoded. The code works properly, but when the CDATA is over 1Mb or so, it's very slow. It seems like a 1Mb CDATA section can be processed in several seconds, but once it gets up to about 3 or 4 Mb, processing time goes up to about 10 minutes. It seems like Xerces is building up a huge buffer of all the data before calling my characters callback. (I'd prefer to get many characters callbacks so I can stream the data to a file, rather than accumulating all the data in memory.) This is the stack crawl I get while it's processing. Garbage collection is also very active during this process. It doesn't seem to matter whether my max heap is set to 128Mb or 256Mb... behavior is the same.

at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source) at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

Since my data is base64-encoded, I don't really need the CDATA... I can just treat it like element data. If I do this, I get a characters callback for each line of the encoded data, and it's wonderfully fast. Unfortunately, the XML files that I need to process are provided by another vendor and contain the CDATA.

Has anybody else run into this? Any workarounds, or any way to give xerces a clue that I want more frequent characters callbacks?

Thanks,
Daniel Rabe
[EMAIL PROTECTED]

Reply via email to