Yonghui Chen wrote: > I have just tried 3 parsers, Xerces Java 1, Xerce Java 2 and Crimson, > use both DOM and SAX parser parse a 960kb XML file for 10 times, the > time cost are:
What are you using to achieve your results? Are you only starting a VM, parsing a single document, showing the time, and exiting the VM? This is not a fair comparison and does not match real-world use of the parser. The sax.Counter, dom.Counter, and xni.Counter samples that come with Xerces2 are very convenient and can provide you a "poor man's" performance test. The xni.Counter is the one I use and I'll explain why. Xerces2 is designed around the new Xerces Native Interface (XNI) which allows us to more easily create new types of parsers and re-use the same code to generate DOM trees, emit SAX events, etc. The default parser configuration does everything: full-fledged scanning of XML documents, DTD validation, namespace binding, XML Schema validation, etc. Depending on your needs, however, you can play tricks with the parser configuration. For example, if you know that the documents are generated and therefore are always well-formed and valid, then you do not need to perform validation. So the validation components can be removed from the pipeline to improve performance. Getting back to my point... The xni.Counter sample (as well as the other XNI samples) allow you to set the parser configuration by name so that you can easily test new parser configurations. There is an XNI sample included that creates a non-validating parser configuration. You can use this with the xni.Counter sample to see how much performance can be gained by not validating every document. This is just one example of ways to achieve better perf, though. However, if you *need* validation then you must find another way to improve performance. I will say a few words on this issue, though. First, in some areas Xerces2 will never be as fast as Xerces 1.x. In particular, we made the decision in the Xerces2 implementation to always transcode the document (i.e. changing the bytes of the document into Java chars). The old parser would defer this work until needed but this created a situation where we had duplicated code which introduced the possibility of more bugs. Also, defer- ring the conversion of the underlying bytes was an issue in terms of memory usage. Also, Xerces2 has much better support for the various standards and other features than its predecessor. You can't do more work in less time so this is one reason why Xerces2 may appear initially slower. However, we believe that the inherent modularity of the system is better in the long-run for continued maintanence and extension of the parser to add new features in the future. Lastly, we have not done serious performing tuning on the new Xerces2 codebase. So we know that this is an area in particular that we can definitely improve in subsequent releases. We want to make the parser faster and better but the standard parser configuration may not match Xerces 1.x for larger documents. Xerces 1.x was heavily optimized but not very flexible so we are accepting a slight performance hit in certain areas. But please hang in there -- it will get better! :) -- Andy Clark * [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
