Re: Schematron streaming?

Mike Beckerle Thu, 24 Aug 2023 08:24:36 -0700

Relaxing the min/maxOccurs seems problematic to me. Lots of things parse up
to a maximum by forward speculation, but stop when maxOccurs is reached.
(This is what dfdl:occursCountKind="implicit" does)


For optional elements (minOccurs 0, maxOccurs 1), this behavior is
particularly important.

When you say "occurrence constraint errors are not recoverable", I'm not
sure I understand what you mean. If something is minOccurs="1"
maxOccurs="1" i.e., a scalar element, then yes, not finding it is a parse
error. But for all other combinations of min/max occurs, the behavior
depends on dfdl:occursCountKind.

If you just put back the original min/max occurs, what exactly is happening
to make you think you need to relax those?

A dfdl:assert statement of kind 'recoverableError' generates a warning aka
validation error, and doesn't interact with parser-behavior (i.e.,
backtracking) at all.

Are you using these 'recoverableError' asserts for your enhanced validation
rules?


On Sat, Aug 19, 2023 at 7:17 AM Claude Mamo <claude.m...@gmail.com> wrote:

> As suggested, I'm attempting to validate the structure of the EDIFACT
> document with DFDL assertions instead of Schematron. One thing I observed
> is that I need to relax *maxOccurs *(i..e, unbounded) and *minOccurs*
> (i.e., 0) otherwise the assertion rules won't be evaluated since occurrence
> constraint errors are not recoverable (I imagine that it would be the same
> case for Schematron) . However, when I relax the constraints, the parsed
> structure changes from:
>
>                   <SegGrp-3>
>                     <RFF-18660>
>                         <C506>
>                             <E1153>VA</E1153>
>                             <E1154>UK19430839</E1154>
>                         </C506>
>                     </RFF-18660>
>                     <RFF-18660>
>                         <C506>
>                             <E1153>ADE</E1153>
>                             <E1154>00000767</E1154>
>                         </C506>
>                     </RFF-18660>
>                 </SegGrp-3>
>
> to
>
>                 <SegGrp-3>
>                     <RFF>
>                         <C506>
>                             <E1153>VA</E1153>
>                             <E1154>UK19430839</E1154>
>                         </C506>
>                     </RFF>
>                 </SegGrp-3>
>                 <SegGrp-3>
>                     <RFF>
>                         <C506>
>                             <E1153>ADE</E1153>
>                             <E1154>00000767</E1154>
>                         </C506>
>                     </RFF>
>                 </SegGrp-3>
>
> I'd rather avoid making breaking changes to the structure so I decided to
> have two flavours of EDIFACT messages: strict and lax. A choice element
> first attempts to parse the message using the strict schema and then falls
> back to the lax schema if parsing on the strict one fails.
>
>     ...
>     ...
>     <xsd:sequence dfdl:choiceBranchKey="INVOIC">
>        <xsd:choice>
>          <xsd:sequence>
>            <xsd:element ref="D03B:INVOIC"/>
>          </xsd:sequence>
>          <xsd:sequence>
>            <xsd:element ref="D03B:Bad-INVOIC"/>
>          </xsd:sequence>
>         </xsd:choice>
>     </xsd:sequence>
>     ...
>     ...
>
> The recoverable assertions are all defined within the *Bad-INVOIC* type
> and, where possible, the occurrence constraints are relaxed within this
> element type. Does it make sense what I wrote or do you think there might
> be a better way to implement this?
>
> Claude
>
> On Sun, Aug 13, 2023 at 12:31 PM Claude Mamo <claude.m...@gmail.com>
> wrote:
>
>> Schematron is really only needed for really rich validation rules that
>>> use the tree-walking capabilities of XPath to scrutinize elements wherever
>>> they appear in the infoset tree.
>>>
>>
>> I'll give it a try with dfdl:assert and see how it goes.
>>
>> Thank for all the feedback!
>>
>> Claude
>>
>> On Mon, Jul 24, 2023 at 11:35 PM Mike Beckerle <mbecke...@apache.org>
>> wrote:
>>
>>> Something to consider:
>>>
>>> I think many useful validation checks can be expressed in DFDL's
>>> expression language using the dfdl:assert statement with
>>> failureType='recoverableError'.
>>>
>>> The sort of constraints that say if this element exists then that can't
>>> exist, or if this has a specific value that that must exist... those sorts
>>> of things can usually be expressed.
>>>
>>> Those are run in an incremental/streaming fashion as the parser
>>> traverses the data based on the schema.
>>>
>>> Recoverable errors from Daffodil are the same as validation errors from
>>> Daffodil's internal "limited" evaluation. They don't guide the parse (don't
>>> cause backtracking), but come out as diagnostic warnings.
>>>
>>> Schematron is really only needed for really rich validation rules that
>>> use the tree-walking capabilities of XPath to scrutinize elements wherever
>>> they appear in the infoset tree.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jul 24, 2023 at 7:47 AM Steve Lawrence <slawre...@apache.org>
>>> wrote:
>>>
>>>> This is correct. The way daffodil currently implements full validation
>>>> (xerces) and custom validation (e.g. schematron) is pretty inefficient.
>>>> We create two infosets: one the kind that the user passed to the parse
>>>> function, and one that is text XML written to a ByteArrayOuputStream in
>>>> memory that is used internally for the validation once the parse is
>>>> completed. We do not currently stream validation.
>>>>
>>>> If you wanted streaming, you would probably need to create custom
>>>> InfosetOutputter, or maybe use the SAXInfosetOutputter with an
>>>> XMLReader
>>>> that chains/tees SAX events to custom schematron validation.
>>>>
>>>> - Steve
>>>>
>>>> On 2023-07-22 03:29 AM, Claude Mamo wrote:
>>>> > Spotted this code so presumably it's not streaming when custom or
>>>> full
>>>> > validation is in force:
>>>> >
>>>> https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/DataProcessor.scala#L345-L356
>>>> <
>>>> https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/DataProcessor.scala#L345-L356
>>>> >
>>>> >
>>>> > Claude
>>>> >
>>>> > On Sat, Jul 22, 2023 at 8:07 AM Claude Mamo <claude.m...@gmail.com
>>>> > <mailto:claude.m...@gmail.com>> wrote:
>>>> >
>>>> >     Hello Daffodil team,
>>>> >
>>>> >     I'm looking into adding support for Schematron validation since we
>>>> >     have had many Smooks developers asking for better validation of
>>>> >     EDIFACT documents. One question I have is whether Schematron
>>>> >     validation is applied in a streaming fashion. I mean, does
>>>> Daffodil
>>>> >     load the whole infoset into memory before applying the Schematron
>>>> >     rules or is Schematron validating on the fly while accumulating
>>>> any
>>>> >     state that is required to be able to evaluate the rules?
>>>> >
>>>> >     Thanks,
>>>> >
>>>> >     Claude
>>>> >
>>>>
>>>>

Re: Schematron streaming?

Reply via email to