Re: Schematron streaming?

Claude Mamo Sat, 19 Aug 2023 04:18:07 -0700

As suggested, I'm attempting to validate the structure of the EDIFACT
document with DFDL assertions instead of Schematron. One thing I observed
is that I need to relax *maxOccurs *(i..e, unbounded) and *minOccurs*
(i.e., 0) otherwise the assertion rules won't be evaluated since occurrence
constraint errors are not recoverable (I imagine that it would be the same
case for Schematron) . However, when I relax the constraints, the parsed
structure changes from:


                  <SegGrp-3>
                    <RFF-18660>
                        <C506>
                            <E1153>VA</E1153>
                            <E1154>UK19430839</E1154>
                        </C506>
                    </RFF-18660>
                    <RFF-18660>
                        <C506>
                            <E1153>ADE</E1153>
                            <E1154>00000767</E1154>
                        </C506>
                    </RFF-18660>
                </SegGrp-3>

to

                <SegGrp-3>
                    <RFF>
                        <C506>
                            <E1153>VA</E1153>
                            <E1154>UK19430839</E1154>
                        </C506>
                    </RFF>
                </SegGrp-3>
                <SegGrp-3>
                    <RFF>
                        <C506>
                            <E1153>ADE</E1153>
                            <E1154>00000767</E1154>
                        </C506>
                    </RFF>
                </SegGrp-3>

I'd rather avoid making breaking changes to the structure so I decided to
have two flavours of EDIFACT messages: strict and lax. A choice element
first attempts to parse the message using the strict schema and then falls
back to the lax schema if parsing on the strict one fails.

    ...
    ...
    <xsd:sequence dfdl:choiceBranchKey="INVOIC">
       <xsd:choice>
         <xsd:sequence>
           <xsd:element ref="D03B:INVOIC"/>
         </xsd:sequence>
         <xsd:sequence>
           <xsd:element ref="D03B:Bad-INVOIC"/>
         </xsd:sequence>
        </xsd:choice>
    </xsd:sequence>
    ...
    ...

The recoverable assertions are all defined within the *Bad-INVOIC* type
and, where possible, the occurrence constraints are relaxed within this
element type. Does it make sense what I wrote or do you think there might
be a better way to implement this?

Claude

On Sun, Aug 13, 2023 at 12:31 PM Claude Mamo <claude.m...@gmail.com> wrote:

> Schematron is really only needed for really rich validation rules that use
>> the tree-walking capabilities of XPath to scrutinize elements wherever they
>> appear in the infoset tree.
>>
>
> I'll give it a try with dfdl:assert and see how it goes.
>
> Thank for all the feedback!
>
> Claude
>
> On Mon, Jul 24, 2023 at 11:35 PM Mike Beckerle <mbecke...@apache.org>
> wrote:
>
>> Something to consider:
>>
>> I think many useful validation checks can be expressed in DFDL's
>> expression language using the dfdl:assert statement with
>> failureType='recoverableError'.
>>
>> The sort of constraints that say if this element exists then that can't
>> exist, or if this has a specific value that that must exist... those sorts
>> of things can usually be expressed.
>>
>> Those are run in an incremental/streaming fashion as the parser traverses
>> the data based on the schema.
>>
>> Recoverable errors from Daffodil are the same as validation errors from
>> Daffodil's internal "limited" evaluation. They don't guide the parse (don't
>> cause backtracking), but come out as diagnostic warnings.
>>
>> Schematron is really only needed for really rich validation rules that
>> use the tree-walking capabilities of XPath to scrutinize elements wherever
>> they appear in the infoset tree.
>>
>>
>>
>>
>>
>> On Mon, Jul 24, 2023 at 7:47 AM Steve Lawrence <slawre...@apache.org>
>> wrote:
>>
>>> This is correct. The way daffodil currently implements full validation
>>> (xerces) and custom validation (e.g. schematron) is pretty inefficient.
>>> We create two infosets: one the kind that the user passed to the parse
>>> function, and one that is text XML written to a ByteArrayOuputStream in
>>> memory that is used internally for the validation once the parse is
>>> completed. We do not currently stream validation.
>>>
>>> If you wanted streaming, you would probably need to create custom
>>> InfosetOutputter, or maybe use the SAXInfosetOutputter with an XMLReader
>>> that chains/tees SAX events to custom schematron validation.
>>>
>>> - Steve
>>>
>>> On 2023-07-22 03:29 AM, Claude Mamo wrote:
>>> > Spotted this code so presumably it's not streaming when custom or full
>>> > validation is in force:
>>> >
>>> https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/DataProcessor.scala#L345-L356
>>> <
>>> https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/DataProcessor.scala#L345-L356
>>> >
>>> >
>>> > Claude
>>> >
>>> > On Sat, Jul 22, 2023 at 8:07 AM Claude Mamo <claude.m...@gmail.com
>>> > <mailto:claude.m...@gmail.com>> wrote:
>>> >
>>> >     Hello Daffodil team,
>>> >
>>> >     I'm looking into adding support for Schematron validation since we
>>> >     have had many Smooks developers asking for better validation of
>>> >     EDIFACT documents. One question I have is whether Schematron
>>> >     validation is applied in a streaming fashion. I mean, does Daffodil
>>> >     load the whole infoset into memory before applying the Schematron
>>> >     rules or is Schematron validating on the fly while accumulating any
>>> >     state that is required to be able to evaluate the rules?
>>> >
>>> >     Thanks,
>>> >
>>> >     Claude
>>> >
>>>
>>>

Re: Schematron streaming?

Reply via email to