You can improve the ability to clearly reject malformed data, and not just
accept correct data.

Consider:

"nodatSTnodataW"

I think the above will give a bunch of validation errors about the data
after a *successful* parse. Pretty sure that's not your intent. You want
this to fail. Your facet patterns aren't just about validating the data.
Those patterns are really about well-formedness of the data. They are the
only place requiring the numeric strings to even be digits for example.

To fix that, I think you want to add assertions with dfdl:checkConstraints
so your pattern facets get checked and affect the parse.

Most convenient way is just with a common type def:

<simpleType name="validString">
    <annotation>
         <appinfo source="http://www.ogf.org/dfdl/";>
            <dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
         </appinfo>
     </annotation>
    <restriction base="xs:string"/>
</simpleType>

Then use validString instead of plain xs:string everywhere. All your
pattern facets will then be checked as part of this assert, which will be
on every string.

Now here's what's a bit interesting....... normally I don't recommend
checkConstraints(.) everywhere in schemas, but I think what we've learned
is that's for typical schemas where numbers are converted from text to
number, and date/time fields get converted to the date/time types.  Those
type conversions enforce a lot of syntax rules on the data. If the data
survives the conversion from text to type it is well formed. So you don't
need checkConstraints() to be sure the data is well formed.

To get the same thing in your all-strings approach we really need to force
the facet patterns to be checked during parsing per the above
validString/checkConstraints() trick.

So my past advice not to use checkConstraints(.) everywhere really does
depend on the facet patterns - are they validation, or are they about
well-formedness of strings? If the latter, then you really do need to call
checkConstraints() at parse time for those strings.

Minor added point: Your degrees regex allows latitudes of 91-99, longitudes
of 181-999, minutes (integer part) of 60-99. That might be ok if that's the
de-facto data you need to handle, but you may also want to be tighter about
that.






On Mon, Sep 19, 2022 at 12:52 PM Roger L Costello <coste...@mitre.org>
wrote:

> Hi Folks,
>
> I am jumping around in my writeups.
>
> As always, please let me know of anything that is unclear.  /Roger
>
>
> --------------------------------------------------------------------------------------
> 11. Variable length, nillable, composite, no choice
>
>
>
> A composite field is one that is composed of parts. There is no separator
> between the parts. The parts may be fixed length or variable length. The
> parts are non-nillable, although the composite field itself may be
> nillable.
>
> This section deals with composite fields containing parts that are
> variable length and the field is nillable.
>
> We will create a DFDL schema for a “Location” field that has a latitude
> and longitude, separated by a dash. Here is a sample value:
>
> 2006N-05912E
>
> That is one value with 7 parts:
>
> The first two digits (20) represents a latitude in degrees.
>
> The next two digits (06) represents the latitude in minutes.
>
> The N indicates the latitude’s hemisphere.
>
> The dash ( - ) separates the latitude values from the following longitude
> values.
>
> The 059 represents the longitude in degrees.
>
> The 12 represents the longitude in minutes.
>
> The E represents the longitude hemisphere.
>
> In other words, the location is latitude 20 degrees, 6 minutes North,
> longitude 59 degrees, 12 minutes East.
>
> Both the latitude minute and longitude minute are variable length are
> expressed as a two-digit integer or as a decimal value. If a decimal, there
> may be 1-4 digits to the right of the decimal point. Here are Location
> values with minute parts (highlighted in yellow) that have decimal values:
>
> 4221.6N-71003.5W
> 4221.63N-71003.57W
> 4221.630N-71003.576W
> 4221.6300N-71003.5760W
>
> Here is one more example of a valid Location value:
>
> -
>
> That value means: no data was available to populate the field.
>
> To re-emphasize, Location is a variable length, nillable, composite field.
>
> Here is an XML Schema declaration of Location, sans any DFDL properties (I
> highlighted in yellow the field name and part names):
>
> <xs:element name="Location" nillable="true">
>     <xs:complexType>
>         <xs:sequence>
>             <xs:element name="LatitudeDegrees">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:pattern value="[0-9]{2}" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LatitudeMinutes">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:pattern value="[0-9]{2}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{1}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{2}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{3}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{4}" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LatitudeHemisphere">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:enumeration value="N" />
>                         <xs:enumeration value="S" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>            <xs:element name="Hyphen">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:enumeration value="-" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LongitudeDegrees">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:pattern value="[0-9]{3}" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LongitudeMinutes">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:pattern value="[0-9]{2}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{1}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{2}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{3}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{4}" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LongitudeHemisphere">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:enumeration value="E" />
>                         <xs:enumeration value="W" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>         </xs:sequence>
>     </xs:complexType>
> </xs:element>
>
> These parts have fixed length: LatitudeDegrees, LatitudeHemisphere,
> Hyphen, LongitudeDegrees, and LongitudeHemisphere.
>
> These parts have variable length: LatitudeMinutes and LongitudeMinutes.
>
> For the fixed length parts, add these two DFDL properties:
>
> dfdl:lengthKind="explicit"
> dfdl:length="__"
>
> For example, LatitudeDegrees has a fixed length of 2. Here is its
> declaration, with the DFDL properties (in yellow) added:
>
> <xs:element name="LatitudeDegrees"
>                       dfdl:lengthKind="explicit"
>                       dfdl:length="2">
>     <xs:simpleType>
>         <xs:restriction base="xs:string">
>             <xs:pattern value="[0-9]{2}" />
>         </xs:restriction>
>     </xs:simpleType>
> </xs:element>
>
> Use the same strategy for the other fixed fields.
>
> LatitudeMinutes is variable length. The part that follows it
> (LatitudeHemisphere) has a fixed length (its value is either N or S). To
> declare LatitudeMinutes, add these two DFDL properties:
>
> dfdl:lengthKind="pattern"
> dfdl:lengthPattern="*regex*"
>
> In the regex use a lookahead pattern. Here is LatitudeMinutes, extended
> with the DFDL properties (in yellow):
>
> <xs:element name="LatitudeMinutes"
>                        dfdl:lengthKind="pattern"
>                        dfdl:lengthPattern=".*?(?=(N|S))">
>     <xs:simpleType>
>         <xs:restriction base="xs:string">
>             <xs:pattern value="[0-9]{2}"/>
>             <xs:pattern value="[0-9]{2}\.[0-9]{1}"/>
>             <xs:pattern value="[0-9]{2}\.[0-9]{2}"/>
>             <xs:pattern value="[0-9]{2}\.[0-9]{3}"/>
>             <xs:pattern value="[0-9]{2}\.[0-9]{4}"/>
>         </xs:restriction>
>     </xs:simpleType>
> </xs:element>
>
> Read that as: the content of LatitudeMinutes is the text up to, but not
> including N or S.
>
> Use the same regex lookahead strategy for LongitudeMinutes.
>
> As I stated earlier, Location is nillable with hyphen as the nil value.
> Further, Location has a complexType. That is a problem. See section 2 for a
> complete discussion of the problem with nillable complexTypes and how to
> deal with it.
>
> Here’s the DFDL schema for the Location field (DFDL is shown in yellow):
>
> <xs:element name="Location">
>     <xs:complexType>
>         <xs:sequence>
>             <xs:element name="LatitudeDegrees"
>                                    dfdl:lengthKind="explicit"
>                                    dfdl:length="2">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:pattern value="[0-9]{2}" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LatitudeMinutes"
>                                    dfdl:lengthKind="pattern"
>                                    dfdl:lengthPattern=".*?(?=(N|S))">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:pattern value="[0-9]{2}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{1}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{2}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{3}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{4}" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LatitudeHemisphere"
>                                    dfdl:lengthKind="explicit"
>                                    dfdl:length="1">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:enumeration value="N" />
>                         <xs:enumeration value="S" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="Hyphen"
>                                   dfdl:lengthKind="explicit"
>                                   dfdl:length="1">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:enumeration value="-" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LongitudeDegrees"
>                                    dfdl:lengthKind="explicit"
>                                    dfdl:length="3">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:pattern value="[0-9]{3}" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LongitudeMinutes"
>                                    dfdl:lengthKind="pattern"
>                                    dfdl:lengthPattern=".*?(?=(E|W))">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:pattern value="[0-9]{2}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{1}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{2}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{3}" />
>                         <xs:pattern value="[0-9]{2}\.[0-9]{4}" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>             <xs:element name="LongitudeHemisphere">
>                 <xs:simpleType>
>                     <xs:restriction base="xs:string">
>                         <xs:enumeration value="E" />
>                         <xs:enumeration value="W" />
>                     </xs:restriction>
>                 </xs:simpleType>
>             </xs:element>
>         </xs:sequence>
>     </xs:complexType>
> </xs:element>
>
> Notice that the last part (LongitudeHemisphere) has no DFDL added. This is
> because I am assuming that it is followed by the delimiter for the Location
> field.
>

Reply via email to