Re: Schema key and unique contraints VERY slow

neilg Fri, 15 Nov 2002 08:58:52 -0800

Hi Eric,

>uncommonness of a feature's use and lack of useable
>support for it is kind of a chicken-and-egg / self-fulfilling prophesy
>kind of thing.


I guess usefulness is in the eye of the user.  :)  I would argue that, when
your XML docs start getting into the megabytes, you've basically got a
small-time database on your hands.  You should probably be using database
techniques/software to determine things like uniqueness of values; a basic
parser is inevitably not going to be as good at that.

>The almost
>equal slowness of the SAX parsing of this makes me wonder if both parsing
>methods are using the same xpath code, perhaps DOM-based xpath code, like
>that provided by Xalan

Xerces schema validation code is written entirely independently of what
kind of parser happens to be using it.  The xpath implementation is
stream-based; the schema spec does limit it enough that we don't need to
maintain any kind of tree structures.  The code that does the xpath
processing are the Field, Selector, XPathMatcher and their inner classes in
the org/apache/xerces/impl/xs/identity package.

>Your comment about "Xerces will take O(N^2) operations to prove it to
>itself" puzzles me though. Surely it doesn't iterate through the entire
>document again every time it finds a key node, to compare for dupes?
No; it iterates through all the other key values it's seen so far (from the
same constraint).

>Surely
>it simply stores the keyvalue into a HashSet or something as it goes and
>checks for previous key existance as it goes, giving O(N) operations?

It stores the previous keys in a vector, marches through and looks at each
one, adding the new key if it finds no dups.  That's why I say there's
certainly room for improvement...  The ValueStore inner classes of
XMLSchemaValidator are where the code lives for this.

Catching all the edge cases in any reimplementation would be a challenge
though.  I'm sure it could be done, but not hitting mainline code's
performance and maintaining spec-conformance (we're extremely conformant
with ID constraints right now) might be tricky.

Let me know if you plan to tackle this and I'll help where I can.
Otherwise, I'll look into this some day, hopefully.  :)

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




|---------+-------------------------------->
|         |           Eric_Schwarzenbach@Cl|
|         |           asswell.com          |
|         |                                |
|         |           11/15/2002 11:35 AM  |
|         |           Please respond to    |
|         |           xerces-j-dev         |
|         |                                |
|---------+-------------------------------->
  
>---------------------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                    
                                                         |
  |       To:       Neil Graham/Toronto/IBM@IBMCA, [EMAIL PROTECTED]         
                                                         |
  |       cc:                                                                          
                                                         |
  |       Subject:  Re: Schema key and unique contraints VERY slow                     
                                                         |
  |                                                                                    
                                                         |
  |                                                                                    
                                                         |
  
>---------------------------------------------------------------------------------------------------------------------------------------------|





I miswrote the version--I'm really using 2.2.1 and so seeing this slowness
with the latest version. I'm moving this thread to xerces-j-dev...

I can understand the relatively low priority of constraint
handling...though uncommonness of a feature's use and lack of useable
support for it is kind of a chicken-and-egg / self-fulfilling prophesy kind
of thing.

I'm not ruling out digging into the code myself to possibly take a look at
the problem and possibly fixing it, though I wouldn't want to make any
promises that I'll find time for it. In case I do, I'd welcome any pointers
anyone wants to give me to finding the code that handles this.

But before digging into code, perhaps we could discuss at a high level,
before d how the xpath bit in these constraints are handled. The almost
equal slowness of the SAX parsing of this makes me wonder if both parsing
methods are using the same xpath code, perhaps DOM-based xpath code, like
that provided by Xalan (a suspicion fed by the lack of SAX based xpath
processors available). If that's the case, then there is likely huge room
for optimization.

Its all the functions and different axes and whatnot that makes xpath
processors hard to write (espec in SAX), but the Schema spec indicates that
these xpaths are allowed only to be a subset of xpath. I'm not very good at
deciphering the mathy jargon of the spec, but it looks to me like the
subset they define would be much easier to write a code for, including SAX
code that doesn't have to resort to building a DOM tree. Even if that's not
the case, the most likely usages of xpath for these constraints are going
to be the most simple ones, which would be relatively easy to detect and
handle with simpler more efficient code that doesn't cover all the
possibilies.

Anyway, I'd appreciate being confirmed or disabused of these speculations
before I decide whether  to start digging into Xerces code. :^)

Your comment about "Xerces will take O(N^2) operations to prove it to
itself" puzzles me though. Surely it doesn't iterate through the entire
document again every time it finds a key node, to compare for dupes? Surely
it simply stores the keyvalue into a HashSet or something as it goes and
checks for previous key existance as it goes, giving O(N) operations?

Eric









                      [EMAIL PROTECTED]

                                               To:
[EMAIL PROTECTED]
                      11/15/2002 10:40         cc:

                      AM                       Subject:  Re: Schema key and
unique contraints VERY slow
                      Please respond to

                      xerces-j-user







Hi Eric,

You'll definitely want to move to 2.2.1; a fair bit of work has been done
on improving performance (and conformance!) of identity constraint support
since 2.0.2.

That said, I can't say I'm shocked that you're not seeing good performance.
If you have N distinct keys in your document, Xerces will take O(N^2)
operations to prove it to itself; I daresay something could be done on this
front.  But our goal is to optimize performance for the case in which ID
constraints aren't used, since they seem to be relatively uncommon; so any
change to make them perform better would have to live within these
constraints.

You're the first I know of to report this kind of problem.  So I guess you
could file this as a bug, but I can't promise how fast it'd be
investigated.  As I always say, patches welcome!  :)

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




|---------+-------------------------------->
|         |           Eric_Schwarzenbach@Cl|
|         |           asswell.com          |
|         |                                |
|         |           11/14/2002 09:04 PM  |
|         |           Please respond to    |
|         |           xerces-j-user        |
|         |                                |
|---------+-------------------------------->
  >
---------------------------------------------------------------------------------------------------------------------------------------------|


  |
|
  |       To:       [EMAIL PROTECTED]
|
  |       cc:
|
  |       Subject:  Schema key and unique contraints VERY slow
|
  |
|
  |
|
  >
---------------------------------------------------------------------------------------------------------------------------------------------|






Xerces2 (I'm using the freshly downloaded 2.0.2) seems to be hideously slow
valdiating xs:unique and xs:key constraints. Painfully slow on a human
scale not simply a processor scale...a large document of mine that
specifies a unique id attribute on each element of a certain kind (where we
are talking 10,000 of these elements in a document of about 4 Meg ) goes
from taking a few seconds to validate without these constraints to several
minutes with them. This happens with both SAX and DOM parsing.

I expect there to be some processing cost to using this feature but this is
fairly ridiculous...doing similar checking in my own java code which is
using the parser takes nowhere nearly as long (in fact building indexes on
the entire document along with it takes nowhere nearly as long). Something
would seem to be seriously amiss, unless I'm out to lunch with regard to my
usage...

An example of my usage is something like:

            <xs:unique name="makeElemUnique">
                  <xs:selector xpath=".//myns:elem"/>
                  <xs:field xpath="@id"/>
            </xs:unique>

This is defined within the scope of the root element which can (will)
contain many of these elems at many different levels.

Should I file this as a bug?

Eric





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Schema key and unique contraints VERY slow

Reply via email to