> Theodore W. Leung wrote...
> 
> 
> Okay, then can we agree on some stuff so that we can all work together? 
> What follows is a proposal to get some discussion going:
> 
> 1. Lets put the performance stuff in tests/performance
> 2. Lets put the data in directories under tests/performance/data

Looks fine.

> 3. Let each set of test data as described in your classification be in
> its own directory.   

Do you mean something like:
tests/performance/data/tagcentric/
tests/performance/data/contentcentric/

> How does the test driver find the data?  Is it via
> a file in the directory or will the suite walk the directory?  

If we agree with the above directory structure, we can pass tests/performance/data/ as 
a 
parameter to the test driver. The driver can then walk through tagcentric/*.xml and 
contentcentric/*.xml and generate performance result report. What do others think?

> How
> should we name the data sets?

I am not very good at naming. But I can suggest something like TC_Small.xml, 
CC_Medium.xml so that others can speak up. :-)

> 4. Please make the driver a JUnit test so we can run it from ant -- this
> means of course adding an ant target

Actually I don't have much idea about JUnit. Do we really need to make the driver a 
JUnit test? We can have a simple driver which can be invoked by ant!. 

Thanks,
Rahul.

> 
> > Thanks,
> > Rahul.
> > 
> > 
> > > 
> > > On Fri, 2002-05-03 at 14:03, Rahul Srivastava wrote:
> > > > 
> > > > Hi folks,
> > > > 
> > > > It has been long talking about improving the performance of Xerces2. There has 
> > > > been some benchmarking done earlier, for instance the one done by Dennis 
> > > > Sosnoski, see: http://www.sosnoski.com/opensrc/xmlbench/index.html . These 
> > > > results are important to know how fast/slow xerces is as compared to other 
> > > > parsers. But, we need to identify areas of improvement in xerces. We need to 
> > > > calculate the time taken by each individual component in the pipeline and 
>figure 
> > > > out which component swallows how much time for various events and then we can 
> > > > actually concentrate on improving performance for those areas. So, here is 
>what 
> > > > we plan to do:
> > > > 
> > > > + sax parsing
> > > >   - time taken
> > > > + dom parsing
> > > >   - dom construction time
> > > >   - dom traversal time
> > > >   - memory consumed
> > > >   - considering the feature deferred-dom as true/false for all of above
> > > > + DTD validation
> > > >   - one time parse, time taken
> > > >   - multiple times parse using same instance, time taken for second parse 
onwards
> > > > + Schema validation
> > > >   - one time parse, time taken
> > > >   - multiple times parse using same instance, time taken for second parse 
onwards
> > > > + optimising the pipeline
> > > >   - calculate pipeline/component initialization time.
> > > >   - calculating the time each component in the pipeline takes to propagate
> > > >     the event.
> > > >   - Using configurations to set up an optimised pipeline for various cases
> > > >     such as novalidation, DTD validation only, etc. and calculate the 
> > > >     time taken. 
> > > > 
> > > > Apart from this should we consider the existing grammar caching framework to 
> > > > evaluate the performance of the parser?
> > > > 
> > > > We have classified the inputs to be used for this testing as follows:
> > > > 
> > > > + instance docs used
> > > >   - tag centric (more tags and small content say 10-50 bytes)
> > > >       Type      Tags#
> > > >     -------------------
> > > >     * small     5-50   
> > > >     * medium    50-500
> > > >     * large     >500  
> > > >     
> > > >   - content centric (less tags say 5-10 and huge content)
> > > >       Type      content b/w a pair of tag
> > > >     -------------------------------------
> > > >     * small     500 kb
> > > >     * medium    500-5000 kb
> > > >     * large     >5000 kb
> > > > 
> > > > We can also have depth of the tags as a criteria for the above cases.
> > > > 
> > > > Actually speaking, there can be enormous combinations and different figures in 
> > > > the above table that reflect the real word instance docs used. I would like to 
> > > > know the view of the community here. Is this data enough to evaluate the 
> > > > performance of the parser. Is there any data which is publicly available and 
>can 
> > > > be used for performance evaluation?.
> > > > 
> > > > + DTD's used
> > > >   - should use different types of entities
> > > >   
> > > > + XMLSchema's used
> > > >   - should use most of the elements and datatypes
> > > >   
> > > > Will it really help in any way?
> > > > 
> > > > Any comments or suggestions appreciated.
> > > > 
> > > > Thanks,
> > > > Rahul.
> > > > 
> > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > 
> > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


Sun Microsystems, Inc.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to