This is correct, but your second overall lesson feels a bit strong, or at least could maybe be interpreted as well-formed but invalid data can *never* create correct XML. I might slightly reword it to something like:

> In some cases data validity must be tested (e.g. via checkConstraints) while parsing to discriminate data to get the correct XML.

The issue with this particular example is that there is ambiguity in the data, and the only way to discriminate which element to parse is to actually inspect/validate the data.

As a counter example, imagine the data format looked like this:

  SSN:123-45-6789 REALID:A12345678

Then our choice can look like this:

  <choice>
    <element ref="RealID" dfdl:initiator="REALID:">
    <element ref="SSN" dfdl:initiator="SSN:">
  </choice>

The RealID and SSN elements now do not need the checkConstraints assertion because the initiator discriminates which element to parse. And now it is possible to have invalid data while still being considered well-formed. For example, this would parse successfully but would be invalid:

  RealID:123-45-6789 SSN:A12345678

Another alternative could be to use assert pattern to sort of "guess" which choice to take. For example, we know SSN must start with a number and RealID must start with a letter. So we could do something like this:

  <element name="SSN" dfdl:lengthKind="explicit" dfdl:length="11">
    <annotation>
      <appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:assert testKind="pattern" testPattern="[0-9]" />
      </appinfo>
    </annotation>
    <simpleType>
      <restriction base="xs:string">
        <pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
      </restriction>
    </simpleType>
  </element>

RealID would look the similar but have testPattern="[A-Z]".

This way the appropriate choice branch is selected based on whether the first character is a letter or digit. Now it's possible to have well-formed but invalid data, as well as get the expected XML. For example, the follow would parse successfully with the right XML, but would cause validation errors for both SSN and RealID:

  123-XX-6789 A123456XX

Note however, that if you had this:

  A23-45-6789

then this approach would consider this a RealID since it starts with a letter, even though it looks almost like an SSN. And you would end up with a validation error saying this isn't a valid RealID, which in practice maybe you would have preferred it to say it's not a valid SSN.

Also note that with this aproach, if the data was

  xxx-45-6789

Then this would be considered not well-formed because it doesn't start with a digit or upper case letter.


On 10/7/22 9:09 AM, Roger L Costello wrote:
Hi Folks,

My input contains a social security number (SSN), e.g.,

123-45-6789

If I declare the SSN element like this:

<xs:element name="SSN"
                       dfdl:lengthKind="explicit"
                      dfdl:length="11">
     <xs:simpleType>
         <xs:restriction base="xs:string">
             <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
         </xs:restriction>
     </xs:simpleType>
</xs:element>

then the parser will accept well-formed but invalid data such as this:

xxx-45-6789

If I want to be notified that the data is not valid, then I can use the -V 
limited option. Then the parser will both generate XML and notify me that the 
input is not valid.

If I add checkConstraints:

<xs:element name="SSN"
                        dfdl:lengthKind="explicit"
                        dfdl:length="11">
     <xs:annotation>
         <xs:appinfo source="http://www.ogf.org/dfdl/";>
             <dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
         </xs:appinfo>
     </xs:annotation>
     <xs:simpleType>
         <xs:restriction base="xs:string">
             <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
         </xs:restriction>
     </xs:simpleType>
</xs:element>

then the parser no longer accepts well-formed but invalid data. No XML is 
generated.

Lesson Learned: Don't use checkConstraints if you want parsing to accept 
well-formed but invalid input.

But, but, but, ........

Things aren't that simple.

Suppose SSN is part of a choice. The choice has two branches. The first branch 
specifies RealID space SSN, the second branch specifies SSN space RealID.

Consider this valid input:

123-45-6789 A12345678

If the DFDL does not use checkConstraints, then this incorrect XML is generated:

   <PersonID>
     <RealID>123-45-6789</RealID>
     <Space> </Space>
     <SSN>A12345678</SSN>
   </PersonID>

Notice that the <RealID> value is the ssn and the <SSN> value is the real id.

If we want to get correct XML, then we must use checkConstraints.

Lesson Learned: Use checkConstraints if you want parsing to generate correct 
XML.

Overall Lesson Learned: You can't have a DFDL schema that both accepts 
well-formed but invalid data and always produces correct XML.

Do you agree?

/Roger

Reply via email to