Re: Schema key and unique contraints VERY slow

Eric_Schwarzenbach Fri, 15 Nov 2002 15:01:24 -0800

> Hi Eric,
>
> >uncommonness of a feature's use and lack of useable
> >support for it is kind of a chicken-and-egg / self-fulfilling prophesy
> >kind of thing.
>
> I guess usefulness is in the eye of the user.  :)  I would argue that,
when
> your XML docs start getting into the megabytes, you've basically got a
> small-time database on your hands.  You should probably be using database
> techniques/software to determine things like uniqueness of values; a
basic
> parser is inevitably not going to be as good at that.


Well my own code using either SAX parsing or DOM treewalking to do similar
uniqueness checking is reasonably good at that, so I don't see why not. I
lot of xml features will never be as good as the equivalent in some other
technology, but that doesn't make them unvaluable as part of xml.

As to this document being database-like, well, it is in a certain way--it's
a taxonomy, and as such highly heirarchical and rather well suited for an
XML database rather than a relational one for many purposes, though a
relational representation has advantages for other purposed. We do actually
use a relational model in the end application for storing this and doing
certain more complex types of querying but XML is very useful for making
this data persistable and easily portable--for management and editing via a
standalone gui tool, etc. So what I'm doing is not unreasonable. Enforcing
uniqueness constraints at this stage of our process (rather than later upon
load to the relation representation), is very handy, and storing them in
the data model as defined by the schema is very useful and more
maintainable that keeping them in the applications.

I suppose the philosophical issue of whether schema validation is the
proper place for enforcing those kinds of constraints, rather than a
layered approach using something like Schematron is an open one--if one
concludes that it is not, then XML Schema's very support of these features
is the problem.

Admittedly the large size of it makes it a less-than-typical usage. However
I'd argue that this is merely serving to bring to light a problem by
magnification, and not the cause of the problem itself.

Sorry to keep running on in support of this feature, and talk up my own use
case, and those that my own experiences have told me are important ones
regardless of common usage.  But in my opinion / experience the usage of
XmlSchema and especially it's more advanced features has been slow to take
off because tool support for them has been very poor. I'm not inditing
Xerces here--I wish the other tools I had to deal with had the same quality
of schema support that Xerces does (really this constraint performance is
the only issue with it I have). We've had to dumb down our usage of all
sorts of schema features because of the lame schema support XMetal and
other tools gives.

So in my opinion, widespread usage of a lot of XmlSchema features will
always lag support in Xerces and other parsers. But tool support has to
come before common usage, and I think parser support has to come before
tool support.

I had further comments about the potential for turning your O(N!) to O(N)
with hashing but your exchange with Joseph Kesselman has preempted that. :
^)

Thanks,

Eric

>
> >The almost
> >equal slowness of the SAX parsing of this makes me wonder if both
parsing
> >methods are using the same xpath code, perhaps DOM-based xpath code,
like
> >that provided by Xalan
>
> Xerces schema validation code is written entirely independently of what
> kind of parser happens to be using it.  The xpath implementation is
> stream-based; the schema spec does limit it enough that we don't need to
> maintain any kind of tree structures.  The code that does the xpath
> processing are the Field, Selector, XPathMatcher and their inner classes
in
> the org/apache/xerces/impl/xs/identity package.
>
> >Your comment about "Xerces will take O(N^2) operations to prove it to
> >itself" puzzles me though. Surely it doesn't iterate through the entire
> >document again every time it finds a key node, to compare for dupes?
> No; it iterates through all the other key values it's seen so far (from
the
> same constraint).

>
> >Surely
> >it simply stores the keyvalue into a HashSet or something as it goes and
> >checks for previous key existance as it goes, giving O(N) operations?
>
> It stores the previous keys in a vector, marches through and looks at
each
> one, adding the new key if it finds no dups.  That's why I say there's
> certainly room for improvement...  The ValueStore inner classes of
> XMLSchemaValidator are where the code lives for this.

>
> Catching all the edge cases in any reimplementation would be a challenge
> though.  I'm sure it could be done, but not hitting mainline code's
> performance and maintaining spec-conformance (we're extremely conformant
> with ID constraints right now) might be tricky.
>
> Let me know if you plan to tackle this and I'll help where I can.
> Otherwise, I'll look into this some day, hopefully.  :)
>
> Cheers,
> Neil
> Neil Graham
> XML Parser Development
> IBM Toronto Lab
> Phone:  905-413-3519, T/L 969-3519
> E-mail:  [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Schema key and unique contraints VERY slow

Reply via email to