Re: Schematron streaming?

Claude Mamo Sat, 26 Aug 2023 21:08:52 -0700

>
> When you say "occurrence constraint errors are not recoverable", I'm not
> sure I understand what you mean. If something is minOccurs="1"
> maxOccurs="1" i.e., a scalar element, then yes, not finding it is a parse
> error. But for all other combinations of min/max occurs, the behavior
> depends on dfdl:occursCountKind.
>


I had a misconception how min/maxOccurs behave in DFDL. The occursCountKind
attribute is new to me but now I've realised that I can ditch this strict
vs lax schema approach. A lot of my problems can be solved by simply by
changing occursCountKind from "implicit" to "parsed" for the EDIFACT
segments (the DFDL schema was based on
https://github.com/DFDLSchemas/EDIFACT).

Cheers!

Claude

On Thu, Aug 24, 2023 at 5:24 PM Mike Beckerle <mbecke...@apache.org> wrote:

> Relaxing the min/maxOccurs seems problematic to me. Lots of things parse
> up to a maximum by forward speculation, but stop when maxOccurs is reached.
> (This is what dfdl:occursCountKind="implicit" does)
>
> For optional elements (minOccurs 0, maxOccurs 1), this behavior is
> particularly important.
>
> When you say "occurrence constraint errors are not recoverable", I'm not
> sure I understand what you mean. If something is minOccurs="1"
> maxOccurs="1" i.e., a scalar element, then yes, not finding it is a parse
> error. But for all other combinations of min/max occurs, the behavior
> depends on dfdl:occursCountKind.
>
> If you just put back the original min/max occurs, what exactly is
> happening to make you think you need to relax those?
>
> A dfdl:assert statement of kind 'recoverableError' generates a warning aka
> validation error, and doesn't interact with parser-behavior (i.e.,
> backtracking) at all.
>
> Are you using these 'recoverableError' asserts for your enhanced
> validation rules?
>
>
> On Sat, Aug 19, 2023 at 7:17 AM Claude Mamo <claude.m...@gmail.com> wrote:
>
>> As suggested, I'm attempting to validate the structure of the EDIFACT
>> document with DFDL assertions instead of Schematron. One thing I observed
>> is that I need to relax *maxOccurs *(i..e, unbounded) and *minOccurs*
>> (i.e., 0) otherwise the assertion rules won't be evaluated since occurrence
>> constraint errors are not recoverable (I imagine that it would be the same
>> case for Schematron) . However, when I relax the constraints, the parsed
>> structure changes from:
>>
>>                   <SegGrp-3>
>>                     <RFF-18660>
>>                         <C506>
>>                             <E1153>VA</E1153>
>>                             <E1154>UK19430839</E1154>
>>                         </C506>
>>                     </RFF-18660>
>>                     <RFF-18660>
>>                         <C506>
>>                             <E1153>ADE</E1153>
>>                             <E1154>00000767</E1154>
>>                         </C506>
>>                     </RFF-18660>
>>                 </SegGrp-3>
>>
>> to
>>
>>                 <SegGrp-3>
>>                     <RFF>
>>                         <C506>
>>                             <E1153>VA</E1153>
>>                             <E1154>UK19430839</E1154>
>>                         </C506>
>>                     </RFF>
>>                 </SegGrp-3>
>>                 <SegGrp-3>
>>                     <RFF>
>>                         <C506>
>>                             <E1153>ADE</E1153>
>>                             <E1154>00000767</E1154>
>>                         </C506>
>>                     </RFF>
>>                 </SegGrp-3>
>>
>> I'd rather avoid making breaking changes to the structure so I decided to
>> have two flavours of EDIFACT messages: strict and lax. A choice element
>> first attempts to parse the message using the strict schema and then falls
>> back to the lax schema if parsing on the strict one fails.
>>
>>     ...
>>     ...
>>     <xsd:sequence dfdl:choiceBranchKey="INVOIC">
>>        <xsd:choice>
>>          <xsd:sequence>
>>            <xsd:element ref="D03B:INVOIC"/>
>>          </xsd:sequence>
>>          <xsd:sequence>
>>            <xsd:element ref="D03B:Bad-INVOIC"/>
>>          </xsd:sequence>
>>         </xsd:choice>
>>     </xsd:sequence>
>>     ...
>>     ...
>>
>> The recoverable assertions are all defined within the *Bad-INVOIC* type
>> and, where possible, the occurrence constraints are relaxed within this
>> element type. Does it make sense what I wrote or do you think there might
>> be a better way to implement this?
>>
>> Claude
>>
>> On Sun, Aug 13, 2023 at 12:31 PM Claude Mamo <claude.m...@gmail.com>
>> wrote:
>>
>>> Schematron is really only needed for really rich validation rules that
>>>> use the tree-walking capabilities of XPath to scrutinize elements wherever
>>>> they appear in the infoset tree.
>>>>
>>>
>>> I'll give it a try with dfdl:assert and see how it goes.
>>>
>>> Thank for all the feedback!
>>>
>>> Claude
>>>
>>> On Mon, Jul 24, 2023 at 11:35 PM Mike Beckerle <mbecke...@apache.org>
>>> wrote:
>>>
>>>> Something to consider:
>>>>
>>>> I think many useful validation checks can be expressed in DFDL's
>>>> expression language using the dfdl:assert statement with
>>>> failureType='recoverableError'.
>>>>
>>>> The sort of constraints that say if this element exists then that can't
>>>> exist, or if this has a specific value that that must exist... those sorts
>>>> of things can usually be expressed.
>>>>
>>>> Those are run in an incremental/streaming fashion as the parser
>>>> traverses the data based on the schema.
>>>>
>>>> Recoverable errors from Daffodil are the same as validation errors from
>>>> Daffodil's internal "limited" evaluation. They don't guide the parse (don't
>>>> cause backtracking), but come out as diagnostic warnings.
>>>>
>>>> Schematron is really only needed for really rich validation rules that
>>>> use the tree-walking capabilities of XPath to scrutinize elements wherever
>>>> they appear in the infoset tree.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jul 24, 2023 at 7:47 AM Steve Lawrence <slawre...@apache.org>
>>>> wrote:
>>>>
>>>>> This is correct. The way daffodil currently implements full validation
>>>>> (xerces) and custom validation (e.g. schematron) is pretty
>>>>> inefficient.
>>>>> We create two infosets: one the kind that the user passed to the parse
>>>>> function, and one that is text XML written to a ByteArrayOuputStream
>>>>> in
>>>>> memory that is used internally for the validation once the parse is
>>>>> completed. We do not currently stream validation.
>>>>>
>>>>> If you wanted streaming, you would probably need to create custom
>>>>> InfosetOutputter, or maybe use the SAXInfosetOutputter with an
>>>>> XMLReader
>>>>> that chains/tees SAX events to custom schematron validation.
>>>>>
>>>>> - Steve
>>>>>
>>>>> On 2023-07-22 03:29 AM, Claude Mamo wrote:
>>>>> > Spotted this code so presumably it's not streaming when custom or
>>>>> full
>>>>> > validation is in force:
>>>>> >
>>>>> https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/DataProcessor.scala#L345-L356
>>>>> <
>>>>> https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/DataProcessor.scala#L345-L356
>>>>> >
>>>>> >
>>>>> > Claude
>>>>> >
>>>>> > On Sat, Jul 22, 2023 at 8:07 AM Claude Mamo <claude.m...@gmail.com
>>>>> > <mailto:claude.m...@gmail.com>> wrote:
>>>>> >
>>>>> >     Hello Daffodil team,
>>>>> >
>>>>> >     I'm looking into adding support for Schematron validation since
>>>>> we
>>>>> >     have had many Smooks developers asking for better validation of
>>>>> >     EDIFACT documents. One question I have is whether Schematron
>>>>> >     validation is applied in a streaming fashion. I mean, does
>>>>> Daffodil
>>>>> >     load the whole infoset into memory before applying the Schematron
>>>>> >     rules or is Schematron validating on the fly while accumulating
>>>>> any
>>>>> >     state that is required to be able to evaluate the rules?
>>>>> >
>>>>> >     Thanks,
>>>>> >
>>>>> >     Claude
>>>>> >
>>>>>
>>>>>

Re: Schematron streaming?

Reply via email to