[ http://issues.apache.org/jira/browse/XERCESC-1363?page=comments#action_60431 ] David Earlam commented on XERCESC-1363: ---------------------------------------
I checked out and built on VS.Net 2003 with the new 5 parameter version of substring called from tokenizeString. I can confirm that this lets me parse the file in about 18 seconds (or 15.7 seconds when using the option -wfile=NUL so as to not measure console IO scrolling). This is about 36kB per second. Even Christian's laptop reaches only 60kB/s. Yet I have some C code that can process this at nearly 1Mb a second. I reckon xerces list validation should be an order of magnitude faster than it currently is. I tried avoiding all parameter passing and call return overhead by replacing tokenizeString's use of substring with { int copysize = skip-index; memcpy(token, &tokenizeStr[index], copysize * sizeof(XMLCh)); token[copysize] = 0; } since tokenizeString has the information to safely do this. Yet even compiled with /Oi this gave at best a 0.04% improvement on calling substring. So I looked elsewhere. I figured that BaseRefVectorOf<XMLCh>::ensureExtraCapacity() is reallocating too often. I made a change to make it grow half as much again each time, rather than by a constant 32 elements, no matter what size the vector. With this change the test file now parses in under 0.8 seconds. This is 672 kB/s ! This is a speed versus data space trade-off which I believe is justified. Is there a unit-test suite I can run to ensure I've broken nothing ? regards, David > DataTypeListValidator extraordinarily slow for long lists > ----------------------------------------------------------- > > Key: XERCESC-1363 > URL: http://issues.apache.org/jira/browse/XERCESC-1363 > Project: Xerces-C++ > Type: Bug > Components: Validating Parser (Schema) (Xerces 1.5 or up only) > Versions: 2.5.0, 2.6.0 > Environment: Windows 2000 > Reporter: David Earlam > Priority: Minor > Attachments: BaseRefVectorOf.c.patch, XMLString.cpp.patch, pq.zip, > second_patch_XMLString.cpp.zip > > Validating an XML instance against a Schema with an unbounded xsd:list type > can take much greater than O(n) processing resources, where n is the number > of items in the list. > To reproduce use this Schema: > pq.xsd > <?xml version="1.0" encoding="utf-8" ?> > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xmlns:pqns="http://swsis.cambridge.arm.com/~dearlam/xercestest/" > targetNamespace="http://swsis.cambridge.arm.com/~dearlam/xercestest/" > elementFormDefault="qualified" version="0.1"> > <xs:annotation> > <xs:documentation xml:lang="en"> > XML schema for Hofstadter's Gödel pq-System. > > Test data for list data type validation. > </xs:documentation> > </xs:annotation> > <xs:element name="pqData" type="pqns:pqDataType"></xs:element> > <xs:complexType name="pqDataType"> > <xs:complexContent> > <xs:restriction base="xs:anyType"> > <xs:sequence minOccurs="1" maxOccurs="1"> > <xs:element name="dashes" > type="pqns:dashBlockType"></xs:element> > <xs:element name="p" type="xs:string" > xsi:nill="true"></xs:element> > <xs:element name="dashes" > type="pqns:dashBlockType"></xs:element> > <xs:element name="q" type="xs:string" > xsi:nill="true"></xs:element> > <xs:element name="dashes" > type="pqns:dashBlockType"></xs:element> > </xs:sequence> > </xs:restriction> > </xs:complexContent> > </xs:complexType> > <xs:complexType name="porqType"> > <xs:simpleContent> > <xs:extension base="xs:string"></xs:extension> > </xs:simpleContent> > </xs:complexType> > <xs:complexType name="dashBlockType"> > <xs:simpleContent> > <xs:extension base="pqns:dataDashes"></xs:extension> > </xs:simpleContent> > </xs:complexType> > <xs:simpleType name="Dash"> > <xs:restriction base="xs:string"> > <xs:pattern value="[\-]"></xs:pattern> > </xs:restriction> > </xs:simpleType> > <xs:simpleType name="dataDashes"> > <xs:restriction base="pqns:DashList"> > <xs:minLength value="0" /> > </xs:restriction> > </xs:simpleType> > <xs:simpleType name="DashList"> > <xs:list itemType="pqns:Dash"></xs:list> > </xs:simpleType> > </xs:schema> > and this XML file > pqData0.xml > <?xml version="1.0" encoding="utf-8" ?> > <pqData xmlns='http://swsis.cambridge.arm.com/~dearlam/xercestest/' > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://swsis.cambridge.arm.com/~dearlam/xercestest/ > http://swsis.cambridge.arm.com/~dearlam/xercestest/pq.xsd"> > <dashes> > - - > </dashes> > <p/> > <dashes>-</dashes> > <q/> > <dashes>-</dashes> > </pqData> > (replacing swsis.cambridge.arm.com/~dearlam/xercestest with your location) > Then use > domprint -wfpp=on pqData0.xml > and > domprint -n -s -wfpp=on pqData0.xml > to print the XML non-validating and validating. > They print in equal short time. OK. > Now, edit pqData0.xml as pqData1.xml and replace > - - > with 4000 lines of > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - - - - - - - - - - - - - - - - - - - - - - - - - - - > This gives a 500Kb file (which mimics my real data). > If you then try > domprint -wfpp=on pqData1.xml > and > domprint -n -s -wfpp=on pqData1.xml > the first prints instantly (pipe it to NUL if you like), but the second > consumes 99% CPU for 230 seconds, then prints. > That's about 2 bytes per second ! > -- > (My suspicion is XMLString::tokenizeString is using subString() to calculate > the string length > way too many times...) > kind regards, > David -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]