On Mon, 2002-05-06 at 04:19, Rahul Srivastava wrote:
> > Theodore W. Leung wrote...
> > 
> > Tuning Xerces is going to be an iterative process.  We need some test
> > data that everyone can use, and we need a test driver that everyone can
> > use.
> 
> I think that is going to be really useful. Everytime we add some huge piece of code, 
> we can actually see how is the performance affected. I will write the test driver 
> also that can be used by everyone.
> 
> > 
> > I'm fine with the metrics and characterization of test data that you are
> > proposing in your message.  I think it's a great start
> > 
> > I'd also like to propose that all the people working on this check the
> > test data and the test classes into the build, so that anyone can run
> > the performance timings for themselves.  (I'd like to see this for the
> > full test suite as well, but that's another message).  
> 
> Right!. I agree w/ you.
> 
> > 
> > I have some time that I can contribute towards this effort.  
> 
> You are always welcome Ted. :-)
> 

Okay, then can we agree on some stuff so that we can all work together? 
What follows is a proposal to get some discussion going:

1. Lets put the performance stuff in tests/performance
2. Lets put the data in directories under tests/performance/data
3. Let each set of test data as described in your classification be in
its own directory.   How does the test driver find the data?  Is it via
a file in the directory or will the suite walk the directory?  How
should we name the data sets?
4. Please make the driver a JUnit test so we can run it from ant -- this
means of course adding an ant target
5. I'll start on test data, beginning with the instance docs/tag centric
data

> Thanks,
> Rahul.
> 
> 
> > 
> > On Fri, 2002-05-03 at 14:03, Rahul Srivastava wrote:
> > > 
> > > Hi folks,
> > > 
> > > It has been long talking about improving the performance of Xerces2. There has 
> > > been some benchmarking done earlier, for instance the one done by Dennis 
> > > Sosnoski, see: http://www.sosnoski.com/opensrc/xmlbench/index.html . These 
> > > results are important to know how fast/slow xerces is as compared to other 
> > > parsers. But, we need to identify areas of improvement in xerces. We need to 
> > > calculate the time taken by each individual component in the pipeline and figure 
> > > out which component swallows how much time for various events and then we can 
> > > actually concentrate on improving performance for those areas. So, here is what 
> > > we plan to do:
> > > 
> > > + sax parsing
> > >   - time taken
> > > + dom parsing
> > >   - dom construction time
> > >   - dom traversal time
> > >   - memory consumed
> > >   - considering the feature deferred-dom as true/false for all of above
> > > + DTD validation
> > >   - one time parse, time taken
> > >   - multiple times parse using same instance, time taken for second parse onwards
> > > + Schema validation
> > >   - one time parse, time taken
> > >   - multiple times parse using same instance, time taken for second parse onwards
> > > + optimising the pipeline
> > >   - calculate pipeline/component initialization time.
> > >   - calculating the time each component in the pipeline takes to propagate
> > >     the event.
> > >   - Using configurations to set up an optimised pipeline for various cases
> > >     such as novalidation, DTD validation only, etc. and calculate the 
> > >     time taken. 
> > > 
> > > Apart from this should we consider the existing grammar caching framework to 
> > > evaluate the performance of the parser?
> > > 
> > > We have classified the inputs to be used for this testing as follows:
> > > 
> > > + instance docs used
> > >   - tag centric (more tags and small content say 10-50 bytes)
> > >       Type      Tags#
> > >     -------------------
> > >     * small     5-50   
> > >     * medium    50-500
> > >     * large     >500  
> > >     
> > >   - content centric (less tags say 5-10 and huge content)
> > >       Type      content b/w a pair of tag
> > >     -------------------------------------
> > >     * small     500 kb
> > >     * medium    500-5000 kb
> > >     * large     >5000 kb
> > > 
> > > We can also have depth of the tags as a criteria for the above cases.
> > > 
> > > Actually speaking, there can be enormous combinations and different figures in 
> > > the above table that reflect the real word instance docs used. I would like to 
> > > know the view of the community here. Is this data enough to evaluate the 
> > > performance of the parser. Is there any data which is publicly available and can 
> > > be used for performance evaluation?.
> > > 
> > > + DTD's used
> > >   - should use different types of entities
> > >   
> > > + XMLSchema's used
> > >   - should use most of the elements and datatypes
> > >   
> > > Will it really help in any way?
> > > 
> > > Any comments or suggestions appreciated.
> > > 
> > > Thanks,
> > > Rahul.
> > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to