The patch looks good to me (except that something funny seemed to have
happened to the last 30 lines of the patch file, which I've removed). I've
committed it.
Many thanks to Michael for the patch.
Sandy Gao
Software Developer, IBM Canada
(1-905) 413-3255
[EMAIL PROTECTED]
Michael R MR
Glavassevich To:
[EMAIL PROTECTED]
<[EMAIL PROTECTED] cc:
aterloo.ca> Subject: Proposed changes to
improve performance of org.apache.xerces.util.URI
01/11/2003 03:36 PM
Please respond to
xerces-j-dev
I have several proposed changes to org.apache.xerces.util.URI, which I
have included in a patch attached to this e-mail.
I noticed that most of the time spent expanding system ids is spent in the
constructor of org.apache.xerces.util.URI (see Revision: 1.6, Fri May 10
16:30:10 2002 UTC in CVS).
Currently checks for allowable URI characters involve scanning of strings
containing these characters. For example: RESERVED_CHARACTERS =
";/?:@&=+$,[]", and MARK_CHARACTERS = "-_.!~*'()". When initializing the
URI's path, a linear scan is being done on these strings to check if a
particular character matches these character classes. Currently,
verifying that ')' is a valid URI character, requires scanning of 21
characters across RESERVED_CHARACTERS and MARK_CHARACTERS, plus a check in
the middle of whether or not that character is alphanumeric. For
alphanumeric characters specified, the 12 characters of
RESERVED_CHARACTERS are scanned before checking if the character is
alphanumeric.
Also, many temporary strings are created which could be avoided by passing
indexes instead of creating substrings, and by using StringBuffers. This
is most relevant to lines 488-542, where the URI is resolved. For URI's
containing many /./ and /../ segments, as many as three temporary strings
are created for each of these segments. For a small document with a
relative URI like ../../../../../../../../../../../../../../../../doc.xml,
most of the parsing time may be spent resolving the URI.
I've made a number of changes locally to the URI class, and have observed
a considerable performance boost in the construction of URI
objects. (There are still other areas for improvement which aren't
addressed by these changes, such as resolving of URIs.)
Here's what I added/modified:
1) I created a lookup table for the character classes of URIs, to
eliminate the linear character searches through strings.
2) I modified the character checking methods, to use the lookup table, and
combined checks i.e. (isReservedCharacter || isUnreservedCharacter ->
isURICharacter).
3) I modified the signatures of initializeAuthority, and initializePath to
take string indexes, eliminating the need to create substrings when
calling these methods.
These changes reduce the time spent expanding/fixing URIs in
XMLEntityManager, as well they should also benefit schema validation, as I
noticed that the URI class is used in the anyURI type validator.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
#### uri-patch.txt has been removed from this note on January 13 2003 by
Sandy Gao
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]