On Mon, 2002-05-06 at 04:19, Rahul Srivastava wrote: > > Theodore W. Leung wrote... > > > > Tuning Xerces is going to be an iterative process. We need some test > > data that everyone can use, and we need a test driver that everyone can > > use. > > I think that is going to be really useful. Everytime we add some huge piece of code, > we can actually see how is the performance affected. I will write the test driver > also that can be used by everyone. > > > > > I'm fine with the metrics and characterization of test data that you are > > proposing in your message. I think it's a great start > > > > I'd also like to propose that all the people working on this check the > > test data and the test classes into the build, so that anyone can run > > the performance timings for themselves. (I'd like to see this for the > > full test suite as well, but that's another message). > > Right!. I agree w/ you. > > > > > I have some time that I can contribute towards this effort. > > You are always welcome Ted. :-) >
Okay, then can we agree on some stuff so that we can all work together? What follows is a proposal to get some discussion going: 1. Lets put the performance stuff in tests/performance 2. Lets put the data in directories under tests/performance/data 3. Let each set of test data as described in your classification be in its own directory. How does the test driver find the data? Is it via a file in the directory or will the suite walk the directory? How should we name the data sets? 4. Please make the driver a JUnit test so we can run it from ant -- this means of course adding an ant target 5. I'll start on test data, beginning with the instance docs/tag centric data > Thanks, > Rahul. > > > > > > On Fri, 2002-05-03 at 14:03, Rahul Srivastava wrote: > > > > > > Hi folks, > > > > > > It has been long talking about improving the performance of Xerces2. There has > > > been some benchmarking done earlier, for instance the one done by Dennis > > > Sosnoski, see: http://www.sosnoski.com/opensrc/xmlbench/index.html . These > > > results are important to know how fast/slow xerces is as compared to other > > > parsers. But, we need to identify areas of improvement in xerces. We need to > > > calculate the time taken by each individual component in the pipeline and figure > > > out which component swallows how much time for various events and then we can > > > actually concentrate on improving performance for those areas. So, here is what > > > we plan to do: > > > > > > + sax parsing > > > - time taken > > > + dom parsing > > > - dom construction time > > > - dom traversal time > > > - memory consumed > > > - considering the feature deferred-dom as true/false for all of above > > > + DTD validation > > > - one time parse, time taken > > > - multiple times parse using same instance, time taken for second parse onwards > > > + Schema validation > > > - one time parse, time taken > > > - multiple times parse using same instance, time taken for second parse onwards > > > + optimising the pipeline > > > - calculate pipeline/component initialization time. > > > - calculating the time each component in the pipeline takes to propagate > > > the event. > > > - Using configurations to set up an optimised pipeline for various cases > > > such as novalidation, DTD validation only, etc. and calculate the > > > time taken. > > > > > > Apart from this should we consider the existing grammar caching framework to > > > evaluate the performance of the parser? > > > > > > We have classified the inputs to be used for this testing as follows: > > > > > > + instance docs used > > > - tag centric (more tags and small content say 10-50 bytes) > > > Type Tags# > > > ------------------- > > > * small 5-50 > > > * medium 50-500 > > > * large >500 > > > > > > - content centric (less tags say 5-10 and huge content) > > > Type content b/w a pair of tag > > > ------------------------------------- > > > * small 500 kb > > > * medium 500-5000 kb > > > * large >5000 kb > > > > > > We can also have depth of the tags as a criteria for the above cases. > > > > > > Actually speaking, there can be enormous combinations and different figures in > > > the above table that reflect the real word instance docs used. I would like to > > > know the view of the community here. Is this data enough to evaluate the > > > performance of the parser. Is there any data which is publicly available and can > > > be used for performance evaluation?. > > > > > > + DTD's used > > > - should use different types of entities > > > > > > + XMLSchema's used > > > - should use most of the elements and datatypes > > > > > > Will it really help in any way? > > > > > > Any comments or suggestions appreciated. > > > > > > Thanks, > > > Rahul. > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
