Re: About performance

Andy Clark 11 Mar 2002 13:03:21 -0000

Yonghui Chen wrote:
> I have just tried 3 parsers, Xerces Java 1, Xerce Java 2 and Crimson,
> use both DOM and SAX parser parse a 960kb XML file for 10 times, the
> time cost are:


What are you using to achieve your results? Are you only
starting a VM, parsing a single document, showing the time,
and exiting the VM? This is not a fair comparison and does
not match real-world use of the parser.

The sax.Counter, dom.Counter, and xni.Counter samples that
come with Xerces2 are very convenient and can provide you
a "poor man's" performance test. The xni.Counter is the
one I use and I'll explain why.

Xerces2 is designed around the new Xerces Native Interface
(XNI) which allows us to more easily create new types of
parsers and re-use the same code to generate DOM trees,
emit SAX events, etc. The default parser configuration does
everything: full-fledged scanning of XML documents, DTD
validation, namespace binding, XML Schema validation, etc.

Depending on your needs, however, you can play tricks with
the parser configuration. For example, if you know that the
documents are generated and therefore are always well-formed
and valid, then you do not need to perform validation. So
the validation components can be removed from the pipeline
to improve performance.

Getting back to my point...

The xni.Counter sample (as well as the other XNI samples)
allow you to set the parser configuration by name so that
you can easily test new parser configurations. There is
an XNI sample included that creates a non-validating
parser configuration. You can use this with the xni.Counter
sample to see how much performance can be gained by not
validating every document.

This is just one example of ways to achieve better perf,
though. However, if you *need* validation then you must
find another way to improve performance. I will say a
few words on this issue, though.

First, in some areas Xerces2 will never be as fast as
Xerces 1.x. In particular, we made the decision in the
Xerces2 implementation to always transcode the document
(i.e. changing the bytes of the document into Java chars).
The old parser would defer this work until needed but
this created a situation where we had duplicated code
which introduced the possibility of more bugs. Also, defer-
ring the conversion of the underlying bytes was an issue in 
terms of memory usage.

Also, Xerces2 has much better support for the various
standards and other features than its predecessor. You
can't do more work in less time so this is one reason
why Xerces2 may appear initially slower. However, we
believe that the inherent modularity of the system is
better in the long-run for continued maintanence and
extension of the parser to add new features in the
future.

Lastly, we have not done serious performing tuning on
the new Xerces2 codebase. So we know that this is an
area in particular that we can definitely improve in
subsequent releases. We want to make the parser faster
and better but the standard parser configuration may
not match Xerces 1.x for larger documents. Xerces 1.x
was heavily optimized but not very flexible so we are
accepting a slight performance hit in certain areas.

But please hang in there -- it will get better! :)

-- 
Andy Clark * [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: About performance

Reply via email to