> Theodore W. Leung wrote... > > > Okay, then can we agree on some stuff so that we can all work together? > What follows is a proposal to get some discussion going: > > 1. Lets put the performance stuff in tests/performance > 2. Lets put the data in directories under tests/performance/data
Looks fine. > 3. Let each set of test data as described in your classification be in > its own directory. Do you mean something like: tests/performance/data/tagcentric/ tests/performance/data/contentcentric/ > How does the test driver find the data? Is it via > a file in the directory or will the suite walk the directory? If we agree with the above directory structure, we can pass tests/performance/data/ as a parameter to the test driver. The driver can then walk through tagcentric/*.xml and contentcentric/*.xml and generate performance result report. What do others think? > How > should we name the data sets? I am not very good at naming. But I can suggest something like TC_Small.xml, CC_Medium.xml so that others can speak up. :-) > 4. Please make the driver a JUnit test so we can run it from ant -- this > means of course adding an ant target Actually I don't have much idea about JUnit. Do we really need to make the driver a JUnit test? We can have a simple driver which can be invoked by ant!. Thanks, Rahul. > > > Thanks, > > Rahul. > > > > > > > > > > On Fri, 2002-05-03 at 14:03, Rahul Srivastava wrote: > > > > > > > > Hi folks, > > > > > > > > It has been long talking about improving the performance of Xerces2. There has > > > > been some benchmarking done earlier, for instance the one done by Dennis > > > > Sosnoski, see: http://www.sosnoski.com/opensrc/xmlbench/index.html . These > > > > results are important to know how fast/slow xerces is as compared to other > > > > parsers. But, we need to identify areas of improvement in xerces. We need to > > > > calculate the time taken by each individual component in the pipeline and >figure > > > > out which component swallows how much time for various events and then we can > > > > actually concentrate on improving performance for those areas. So, here is >what > > > > we plan to do: > > > > > > > > + sax parsing > > > > - time taken > > > > + dom parsing > > > > - dom construction time > > > > - dom traversal time > > > > - memory consumed > > > > - considering the feature deferred-dom as true/false for all of above > > > > + DTD validation > > > > - one time parse, time taken > > > > - multiple times parse using same instance, time taken for second parse onwards > > > > + Schema validation > > > > - one time parse, time taken > > > > - multiple times parse using same instance, time taken for second parse onwards > > > > + optimising the pipeline > > > > - calculate pipeline/component initialization time. > > > > - calculating the time each component in the pipeline takes to propagate > > > > the event. > > > > - Using configurations to set up an optimised pipeline for various cases > > > > such as novalidation, DTD validation only, etc. and calculate the > > > > time taken. > > > > > > > > Apart from this should we consider the existing grammar caching framework to > > > > evaluate the performance of the parser? > > > > > > > > We have classified the inputs to be used for this testing as follows: > > > > > > > > + instance docs used > > > > - tag centric (more tags and small content say 10-50 bytes) > > > > Type Tags# > > > > ------------------- > > > > * small 5-50 > > > > * medium 50-500 > > > > * large >500 > > > > > > > > - content centric (less tags say 5-10 and huge content) > > > > Type content b/w a pair of tag > > > > ------------------------------------- > > > > * small 500 kb > > > > * medium 500-5000 kb > > > > * large >5000 kb > > > > > > > > We can also have depth of the tags as a criteria for the above cases. > > > > > > > > Actually speaking, there can be enormous combinations and different figures in > > > > the above table that reflect the real word instance docs used. I would like to > > > > know the view of the community here. Is this data enough to evaluate the > > > > performance of the parser. Is there any data which is publicly available and >can > > > > be used for performance evaluation?. > > > > > > > > + DTD's used > > > > - should use different types of entities > > > > > > > > + XMLSchema's used > > > > - should use most of the elements and datatypes > > > > > > > > Will it really help in any way? > > > > > > > > Any comments or suggestions appreciated. > > > > > > > > Thanks, > > > > Rahul. > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] Sun Microsystems, Inc. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
