hi, most of recent parsers will be doing something like that (including Xerces) basically by looking up interned String object from cache based on input characters. i have tested influence of it on perfromance and typically it is about 3-10% difference in performance (depending how much of input are tags with repeatable element or attribute names - you can see such results for XPP3: look for "Mostly Tags" and "Mostly Text" documents and XPP3 parser configurations "MXP1 beta1 w/NS" and "MXP1 beta1 w/NS no-string-caching" at: http://www.extreme.indiana.edu/~aslom/xpp_sax2bench/results.html).
thanks, alek "KUMAR,PANKAJ (HP-Cupertino,ex1)" wrote: > Hi, > > Few months ago I had written a program to measure Java XML parsing > performance. May be it could be of some use here. You can find details at > http://www.pankaj-k.net/xpb4j/ > > I am not aware of Xerces internals so whatever I say here may not make much > sense but one area where I feel that optimization at parser level can > improve performance in server based applications is use of same String > objects across parse runs. Let me elaborate -- A server program that accepts > XML documents with every request comes across instances of documents from a > small subset of schema. These documents use the same element names, > attrobutes and namespace URIs. If the same immutable String objects can be > used for these then there could be significant saving in allocation and > deallocations. > > The problem is slightly complicated as the identification of repeating > Strings must happen at a much lower level, before a String object is created > of a lookup. What could do the job is perhpas some smart lookup during > lexical analysis. > > Regards, > Pankaj Kumar > Web Services Architect > HP Middleware > > -----Original Message----- > From: Gopal Sharma > To: [email protected]; [EMAIL PROTECTED] > Sent: 5/5/02 7:18 AM > Subject: [Xerces2] Measuring performance and optimization > > FYI > > Hi, > > I have forwarded this mail to _YOU_ ( general and xerces-j-user ) in > view > that you might be using *Xerces 2* in one way or other and could > provide > some data/details/suggestions/comments which would help us in this > effort. > > Thanks in advance for your valuable suggestion(s) and comment(s). > > - Gopal > > ------------- Begin Forwarded Message ------------- > Date: Fri, 3 May 2002 21:03:00 +0000 (Asia/Calcutta) > From: Rahul Srivastava <[EMAIL PROTECTED]> > Subject: [xerces2] Measuring performance and optimization > To: [email protected] > > Hi folks, > > It has been long talking about improving the performance of Xerces2. > There has > been some benchmarking done earlier, for instance the one done by Dennis > > Sosnoski, see: http://www.sosnoski.com/opensrc/xmlbench/index.html . > These > results are important to know how fast/slow xerces is as compared to > other > parsers. But, we need to identify areas of improvement in xerces. We > need to > calculate the time taken by each individual component in the pipeline > and figure > out which component swallows how much time for various events and then > we can > actually concentrate on improving performance for those areas. So, here > is what > we plan to do: > > + sax parsing > - time taken > + dom parsing > - dom construction time > - dom traversal time > - memory consumed > - considering the feature deferred-dom as true/false for all of above > + DTD validation > - one time parse, time taken > - multiple times parse using same instance, time taken for second > parse > onwards > + Schema validation > - one time parse, time taken > - multiple times parse using same instance, time taken for second > parse > onwards > + optimising the pipeline > - calculate pipeline/component initialization time. > - calculating the time each component in the pipeline takes to > propagate > the event. > - Using configurations to set up an optimised pipeline for various > cases > such as novalidation, DTD validation only, etc. and calculate the > time taken. > > Apart from this should we consider the existing grammar caching > framework to > evaluate the performance of the parser? > > We have classified the inputs to be used for this testing as follows: > > + instance docs used > - tag centric (more tags and small content say 10-50 bytes) > Type Tags# > ------------------- > * small 5-50 > * medium 50-500 > * large >500 > > - content centric (less tags say 5-10 and huge content) > Type content b/w a pair of tag > ------------------------------------- > * small 500 kb > * medium 500-5000 kb > * large >5000 kb > > We can also have depth of the tags as a criteria for the above cases. > > Actually speaking, there can be enormous combinations and different > figures in > the above table that reflect the real word instance docs used. I would > like to > know the view of the community here. Is this data enough to evaluate the > > performance of the parser. Is there any data which is publicly available > and can > be used for performance evaluation?. > > + DTD's used > - should use different types of entities > > + XMLSchema's used > - should use most of the elements and datatypes > > Will it really help in any way? > > Any comments or suggestions appreciated. > > Thanks, > Rahul. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ------------- End Forwarded Message ------------- > > --------------------------------------------------------------------- > In case of troubles, e-mail: [EMAIL PROTECTED] > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > In case of troubles, e-mail: [EMAIL PROTECTED] > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
