This example gives me an opportunity to motivate the point that applies to
all the recently posted examples, which is that validation -V limited is
not enough.

We need the facet validation for strings to affect parsing decisions. Doing
so affects the composition properties of the element definitions.

In tiny tests of individual values you cannot see the difference. But in
examples where elements are composed together into sequences and choices
you can see why this is needed.

Let's suppose we have data which contains two fields. They are a SSN and a
MassReal-ID number, or the opposite order, Mass Real-ID, then SSN.

Similar to the SSN element definition in the thread, we define an element
"MassRealID"

<element name="MassRealID"
      dfdl:lengthKind='delimited'>
    <simpleType>
       <restriction base="xs:string">
          <pattern value="[A-Z]\d{8,10}"/> <!-- one alpha char followed by
8 to 10 digits -->
      </restriction>
   </simpleType>
</element>

Now the syntax for MassRealID is disjoint from SSNs because the first
character must be an alpha char. So we can positively tell apart a valid
SSN and a valid MassRealID.

That means we should be able to do this to model our id fields that can be
in either order:

<element name="id_fields">
  <complexType>
    <choice>
       <sequence dfdl:separator="|">
           <element ref="SSN"/>  <element ref="MassRealID"/>
       </sequence>
       <sequence dfdl:separator="|">
           <element ref="MassRealID"/>  <element ref="SSN"/>
       </sequence>
    </choice>
  </complexType>
</element>

This seems reasonable. Certainly many element definitions will compose this
way.  This kind of compositionality is important in format descriptions.

So, now let's parse some data. The data looks like "S2345678901|123-45-6789"

Parsing this with Daffodil will succeed producing this infoset:

<id_fields>
  <SSN>S2345678901</SSN>
  <MassRealID>123-45-6789</MassRealID>
</id_fields>

Well that's wrong.  The SSN content is a mass-real-id syntax, and the
MassRealID is holding a ssn syntax. The whole point of our format was NOT
to do this.

Even if you have validation enabled via daffodil -V limited, or -V on, that
just means you will still get that exact same infoset, just you will ALSO
get these two warnings:

Validation Error: SSN doesn't match facet pattern ....
Validation Error: MassRealID doesn't match facet pattern ...

So adding -V limited is simply not enough. We got a wrong infoset and two
validation errors rather than the correct infoset (and no validation errors)

We wan't the DFDL format specification to recognize the different syntax of
SSNs and MassRealIDs and tell them apart when parsing.

What went wrong is that the schema failed to distinguish "well-formed" from
"valid". The pattern facets are the only thing that distinguish SSNs from
MassRealID data.
The pattern facets aren't used when parsing. They're only used to validate
the result of parsing.

We want that data example to produce this infoset and with validation
turned on it should produce no validation errors:

<id_fields>
    <MassRealID>S2345678901</MassRealID>
    <SSN>123-45-67879</SSN>
</id_fields>

In order for DFDL to express this, one must declare, in the DFDL schema,
that the XSD facets must be used to determine when data is well-formed, and
not just as a post-parse validation check.

The way this is expressed in DFDL is with a dfdl:assert statement which
tells the DFDL processor explicitly to check the facets constraints when
parsing. Facet-check failures are escalated to parse-errors by an assertion
failure.

This dfdl:assert statement, with surrounding annotation/appinfo
boilerplate, is what is needed. For both the SSN and the MassRealID we need
this assertion:

<annotation><appinfo source="http://www.ogf.org/dfdl/";>
    <dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
</appinfo></annotation>

One can package that to make it convenient to use:

<simpleType name="validString">
  <annotation><appinfo source="http://www.ogf.org/dfdl/";>
      <dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
  </appinfo></annotation>
  <restriction base="xs:string"/>
</simpleType>

Then change all the type references like type="xs:string" or
base="xs:string" to type="pre:validString" or base="pre:validString" using
the right namespace prefix, or just type="validString" or
base="validString" if there is no namespace.

Now you will have the MassRealID and SSN elements as restrictions of
validString, which asserts (as part of parsing) that their facets must be
correct as part of parsing.

Now the parse of "S2345678901|123-45-6789" will first try to to parse
"S2345678901" as an SSN and that will FAIL due to the assertion check.
The choice will backtrack and choose the other alternative, and try to
parse it instead as a MassRealID, which will succeed and ultimately we will
get the parse output we expect, and we will not get validation errors.

Now one might ask: Why isn't this behavior the default behavior for DFDL?
The answer is for uniformity of behavior across all types.
This challenge really only arises for uses of type xs:string. If you use
numbers or dates or booleans, then the data must be convertible into that
type, and that is most commonly the basis for the concept of "well formed".
It's only for structured string data, where for the data to be well-formed
you really want to enforce more than just the type "xs:string", that's
where this facet-checking becomes important to the DFDL parse behavior.


-mike beckerle





On Tue, Sep 27, 2022 at 9:17 AM Roger L Costello <coste...@mitre.org> wrote:

> Hi Folks,
>
>
>
> Please let me know of anything that is unclear.  /Roger
>
>
> --------------------------------------------------------------------------------------
>
> 5. Fixed length, not nillable, not composite, no choice
>
>
>
> This is an easy one.
>
> We will create a DFDL schema for a field containing a Social Security
> Number. I named the field “SSN.”
>
> Here is a sample value:
>
> 123-45-6789
>
> The field has a fixed length of 11.
>
> Field Requirements:
>
> >>  Fixed length (11)
>
> >>  Not nillable
>
> >>  No choice
>
> >>  Not composite, i.e., single atomic value
>
>
>
> Add to the field these two DFDL properties:
>
> dfdl:lengthKind="explicit"
> dfdl:length="__"
>
> Here’s the DFDL schema with the DFDL properties added (shown in yellow):
>
> <xs:element name="SSN"
>                        dfdl:lengthKind="explicit"
>                        dfdl:length="11">
>     <xs:simpleType>
>         <xs:restriction base="xs:string">
>             <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
>         </xs:restriction>
>     </xs:simpleType>
> </xs:element>
>
> That’s it!
>
> One last (important) point: When parsing input with Daffodil use the -V
> limited option. The option instructs Daffodil to validate the field
> against the XSD pattern facet. With this erroneous input value:
>
> xxx-45-6789
>
>
>
> Daffodil gives this very helpful error message on parsing:
>
>
>
> [error] Validation Error: SSN failed facet checks due to: facet
> pattern(s): [0-9]{3}-[0-9]{2}-[0-9]{4}
>
>
>
> If you don’t use the -V limited option, then Daffodil won’t validate the
> parts against the XSD facets. Consequently, Daffodil will not report any
> errors with the above erroneous input. Why? Because if we ignore the facets
> in this element declaration:
>
> <xs:element name="SSN"
>                        dfdl:lengthKind="explicit"
>                        dfdl:length="11">
>     <xs:simpleType>
>         <xs:restriction base="xs:string">
>             <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
>         </xs:restriction>
>     </xs:simpleType>
> </xs:element>
>
> then it is simply saying that the input is any text of length 11, and
> “xxx-45-6789” certainly fits that specification.
>

Reply via email to