[ http://issues.apache.org/jira/browse/XERCESC-1363?page=comments#action_60408 ] Christian Will commented on XERCESC-1363: -----------------------------------------
Hi, I found a better solution to fix that performance problem. I reviewed my changes and figured out that if we don't have the source string length, we can not verify that the start index is inside of our source string. And if we want to do that in release we have to find a better way. After that I looked into the function XMLString::tokenizeString(...) from where we call XMLString::substring(...) and saw that we are already calculating the length of our source string. So here is what I did : I created a new function XMLString::substring(...) with an additional parameter srcStrLength. So whenever we have the length information available, we forward this information and avoid all additional string length calculations. I changed the function call in XMLString::tokenizeString(...) to the new function. David I tested your large xml file without my changes and could parse it (MS C++ 6) on my laptop in ~160 seconds. After I applied the new patch in just 9 seconds. :-) I think this way is better. The functionality of the original is not changed, we just offer a better substring function for the case the string length information is already available. I'll attach the new patch. Regards, Christian > DataTypeListValidator extraordinarily slow for long lists > ----------------------------------------------------------- > > Key: XERCESC-1363 > URL: http://issues.apache.org/jira/browse/XERCESC-1363 > Project: Xerces-C++ > Type: Bug > Components: Validating Parser (Schema) (Xerces 1.5 or up only) > Versions: 2.5.0, 2.6.0 > Environment: Windows 2000 > Reporter: David Earlam > Priority: Minor > Attachments: XMLString.cpp.patch, pq.zip, second_patch_XMLString.cpp.zip > > Validating an XML instance against a Schema with an unbounded xsd:list type > can take much greater than O(n) processing resources, where n is the number > of items in the list. > To reproduce use this Schema: > pq.xsd > <?xml version="1.0" encoding="utf-8" ?> > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xmlns:pqns="http://swsis.cambridge.arm.com/~dearlam/xercestest/" > targetNamespace="http://swsis.cambridge.arm.com/~dearlam/xercestest/" > elementFormDefault="qualified" version="0.1"> > <xs:annotation> > <xs:documentation xml:lang="en"> > XML schema for Hofstadter's Gödel pq-System. > > Test data for list data type validation. > </xs:documentation> > </xs:annotation> > <xs:element name="pqData" type="pqns:pqDataType"></xs:element> > <xs:complexType name="pqDataType"> > <xs:complexContent> > <xs:restriction base="xs:anyType"> > <xs:sequence minOccurs="1" maxOccurs="1"> > <xs:element name="dashes" > type="pqns:dashBlockType"></xs:element> > <xs:element name="p" type="xs:string" > xsi:nill="true"></xs:element> > <xs:element name="dashes" > type="pqns:dashBlockType"></xs:element> > <xs:element name="q" type="xs:string" > xsi:nill="true"></xs:element> > <xs:element name="dashes" > type="pqns:dashBlockType"></xs:element> > </xs:sequence> > </xs:restriction> > </xs:complexContent> > </xs:complexType> > <xs:complexType name="porqType"> > <xs:simpleContent> > <xs:extension base="xs:string"></xs:extension> > </xs:simpleContent> > </xs:complexType> > <xs:complexType name="dashBlockType"> > <xs:simpleContent> > <xs:extension base="pqns:dataDashes"></xs:extension> > </xs:simpleContent> > </xs:complexType> > <xs:simpleType name="Dash"> > <xs:restriction base="xs:string"> > <xs:pattern value="[\-]"></xs:pattern> > </xs:restriction> > </xs:simpleType> > <xs:simpleType name="dataDashes"> > <xs:restriction base="pqns:DashList"> > <xs:minLength value="0" /> > </xs:restriction> > </xs:simpleType> > <xs:simpleType name="DashList"> > <xs:list itemType="pqns:Dash"></xs:list> > </xs:simpleType> > </xs:schema> > and this XML file > pqData0.xml > <?xml version="1.0" encoding="utf-8" ?> > <pqData xmlns='http://swsis.cambridge.arm.com/~dearlam/xercestest/' > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://swsis.cambridge.arm.com/~dearlam/xercestest/ > http://swsis.cambridge.arm.com/~dearlam/xercestest/pq.xsd"> > <dashes> > - - > </dashes> > <p/> > <dashes>-</dashes> > <q/> > <dashes>-</dashes> > </pqData> > (replacing swsis.cambridge.arm.com/~dearlam/xercestest with your location) > Then use > domprint -wfpp=on pqData0.xml > and > domprint -n -s -wfpp=on pqData0.xml > to print the XML non-validating and validating. > They print in equal short time. OK. > Now, edit pqData0.xml as pqData1.xml and replace > - - > with 4000 lines of > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - - - - - - - - - - - - - - - - - - - - - - - - - - - > This gives a 500Kb file (which mimics my real data). > If you then try > domprint -wfpp=on pqData1.xml > and > domprint -n -s -wfpp=on pqData1.xml > the first prints instantly (pipe it to NUL if you like), but the second > consumes 99% CPU for 230 seconds, then prints. > That's about 2 bytes per second ! > -- > (My suspicion is XMLString::tokenizeString is using subString() to calculate > the string length > way too many times...) > kind regards, > David -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]