Hello all,
I'm facing the following problem, and I'd very much appreciate comments from
people on this list. I appologize for the longer post.
We have constructed an XML Schema and a Schematron schema that both together
constrain the data we accept into an information system. Technically
speaking, it works great. We have an extensive suite of XML inputs and
corresponding expected error reports and we use these to test that the
checker does what it's supposed to do.
Where this fails is in the error reports from the XML Schema validation. In
the Schematron reports we can speak in terms of the problem domain (data
about scientific projects, their participants and the financial support
thereof) - we write the messages ourselves. So we can e.g. report that a
project should specify the date when it started. That's understandable to all
our users and we can provide all useful diagnostics as to where the problem
is located. However, if we place this constraint in the XML Schema, all we
get is a cvc-something error report that says that the content of element
'lifecycle' doesn't match its model. This is accompanied with line and column
numbers. In this form, our users find it pretty much indigestible.
The first idea I had was to run away from XML Schema, to place all
constraints in the Schematron schema. There might even be a way to
automatically generate the Schematron constraints from an XML Schema,
where we might be able to adjust the violation report texts. If we are sure
all constraints from the XML Schema are moved to Schematron, we could skip
the XML Schema validation step.
However, moving all the constraints to Schematron would increase the
number of assertions from some 800 to some 6000 (est.) and that's a level of
complexity neither we, nor our customer can afford. We might also face
performance problems.
The feasible way out of this seems that of gradually adding checks into the
Schematron schema to report violations there. We'll start with the most
frequent ones, and continue with those where the error reports are especially
cryptic. In the process, we would need to know in every moment that no error
remains unreported. We might report an error twice, but then a simple
correction - suppression of the report by XML Schema validator - should take
care of that. In the end, we might find that something like 30% of the
constraints are moved to Schematron.
Now, we need to selectively suppress those XML Schema violations that will be
reported by Schematron. We can't move the XML Schema constraint types one by
one. It will always be a constraint type in a specific context (of an element
type, or of a XML Schema type).
For that, we could use a common way of locating errors. I'm afraid that
getting the physical locations from Schematron is too difficult a task and
the result might not quite match the physical locations by Xerces. On the
other hand, Schematron can reliably produce 'logical' locations, something
like 'canonical XPath' to the node where the violation occurred. E.g.
'/root/a[1]/b[23]' meaning the 23rd 'b' child of the first 'a' child of
'root'. (Things are more difficult in the presence of namespaces, but still
tractable.)
How difficult would it be to extend Xerces to:
(i) Produce 'logical' locations in terms of 'canonical' XPaths
as described above.
(ii) Pass these locations to XMLErrorReporter.
Then I could set up a filtering XMLErrorReporter that would let me gradually
move violation reports from XML Schema to Schematron.
Is there a better way to achieve our goal?
Jan Dvorak
MathAn Praha
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]