Hello all,

I'm facing the following problem, and I'd very much appreciate comments from 
people on this list. I appologize for the longer post.

We have constructed an XML Schema and a Schematron schema that both together 
constrain the data we accept into an information system. Technically 
speaking, it works great. We have an extensive suite of XML inputs and 
corresponding expected error reports and we use these to test that the 
checker does what it's supposed to do.

Where this fails is in the error reports from the XML Schema validation. In 
the Schematron reports we can speak in terms of the problem domain (data 
about scientific projects, their participants and the financial support 
thereof) - we write the messages ourselves. So we can e.g. report that a 
project should specify the date when it started. That's understandable to all 
our users and we can provide all useful diagnostics as to where the problem 
is located. However, if we place this constraint in the XML Schema, all we 
get is a cvc-something error report that says that the content of element 
'lifecycle' doesn't match its model. This is accompanied with line and column 
numbers. In this form, our users find it pretty much indigestible.

The first idea I had was to run away from XML Schema, to place all 
constraints in the Schematron schema. There might even be a way to 
automatically generate the Schematron constraints from an XML Schema, 
where we might be able to adjust the violation report texts. If we are sure 
all constraints from the XML Schema are moved to Schematron, we could skip 
the XML Schema validation step.

However, moving all the constraints to Schematron would increase the 
number of assertions from some 800 to some 6000 (est.) and that's a level of 
complexity neither we, nor our customer can afford. We might also face 
performance problems.

The feasible way out of this seems that of gradually adding checks into the 
Schematron schema to report violations there. We'll start with the most 
frequent ones, and continue with those where the error reports are especially 
cryptic. In the process, we would need to know in every moment that no error 
remains unreported. We might report an error twice, but then a simple 
correction - suppression of the report by XML Schema validator - should take 
care of that. In the end, we might find that something like 30% of the 
constraints are moved to Schematron.

Now, we need to selectively suppress those XML Schema violations that will be 
reported by Schematron. We can't move the XML Schema constraint types one by 
one. It will always be a constraint type in a specific context (of an element 
type, or of a XML Schema type).

For that, we could use a common way of locating errors. I'm afraid that 
getting the physical locations from Schematron is too difficult a task and 
the result might not quite match the physical locations by Xerces. On the 
other hand, Schematron can reliably produce 'logical' locations, something 
like 'canonical XPath' to the node where the violation occurred. E.g. 
'/root/a[1]/b[23]' meaning the 23rd 'b' child of the first 'a' child of 
'root'. (Things are more difficult in the presence of namespaces, but still 
tractable.)

How difficult would it be to extend Xerces to:
 (i) Produce 'logical' locations in terms of 'canonical' XPaths
     as described above.
 (ii) Pass these locations to XMLErrorReporter.
Then I could set up a filtering XMLErrorReporter that would let me gradually 
move violation reports from XML Schema to Schematron. 

Is there a better way to achieve our goal?


Jan Dvorak
MathAn Praha

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to