Design Principles
I follow these design principles:

  1.  Say nothing when you have nothing to say.
  2.  Don't stop parsing when invalid input is encountered but do generate 
warning messages.
  3.  Use the strongest datatype possible.
  4.  Display data, not syntax.

Say nothing when you have nothing to say
Oftentimes a special symbol is inserted into a data field to indicate that no 
data is available to populate the field. For example, a document might contain 
a person's first name, middle name, and last name. If no data is available for 
the middle name, then populate the field with N/A. Here is a sample document:
John, N/A, Smith
John Smith has no middle name.
One approach to modeling a field with no data is to treat the special symbol as 
a nil value and output an element for the field and use xsi:nil="true" to 
indicate that there was no data available, e.g.,
<MiddleName xsi:nil="true"/>
An alternate approach is to output no element when there is no data available, 
e.g., do not output a <MiddleName> element. This is achieved using this DFDL 
idiom:
    <xs:choice>
        <xs:sequence dfdl:initiator="N/A"/>
        <xs:element name="MiddleName" type="xs:string"/>
    </xs:choice>

If the data field contains N/A then do not output an XML element. Otherwise, 
output a <MiddleName> element, populated with the person's middle name. This is 
the approach I take.
Don't stop parsing when invalid input is encountered but do generate warning 
messages
Sometimes a data field has a restricted set of values. For example, a field 
representing the color of a banana might be restricted to the values yellow, 
green, and brown. If the field contains another color, say blue, that value is 
well-formed but not valid. One approach to dealing with well-formed but invalid 
data is to halt parsing as soon as invalid data is detected. That's the 
behavior you get when you use the DFDL assert/checkConstraints properties.
An alternate approach is to design the DFDL schema to specify the legal values 
using xs:enumeration facets and then run Daffodil with the flag --validation 
limited. That flag instructs Daffodil to generate a warning when data is 
encountered that doesn't conform to the facets but continue parsing.
Use the strongest datatype possible
For text data formats it is tempting to model everything as a string. It's 
simple. You can precisely specify the form of the data using a regular 
expression (regex). For example, a field containing a latitude degrees value 
could be modeled as a string along with a regex to constrain the string to two 
integer digits:
    <xs:element name="LatitudeDegrees"
                  dfdl:lengthKind="explicit"
                        dfdl:length="2">
        <xs:simpleType>
            <xs:restriction base="validString">
                <xs:pattern value="[0-9]{2}"/>
            </xs:restriction>
        </xs:simpleType>
    </xs:element>

Then this input - 05 - will yield this output 
<LatitudeDegrees>05</LatitudeDegrees>. Notice the 0 in the output. For numeric 
values we would expect leading zeros to be removed. However, since data field 
has been modeled as a string, there is no concept of normalizing numeric 
values. For that, you need to use a numeric datatype.
A better approach is to use the strongest datatype possible. For the example, 
use an xs:unsignedInt datatype:
<xs:element name="LatitudeDegrees"
                    type="xs:unsignedInt"
                    dfdl:lengthKind="explicit"
                    dfdl:length="2"
                    dfdl:textNumberRep="standard"
                    dfdl:textNumberCheckPolicy="strict"
                    dfdl:textNumberPattern="#"
                    dfdl:textStandardGroupingSeparator=","
                    dfdl:textStandardDecimalSeparator="."
                    dfdl:textStandardBase="10"
                    dfdl:textNumberRounding="pattern">
</xs:element>

This input - 05 -yields this output <LatitudeDegrees>5</LatitudeDegrees>. 
Notice the leading 0 has been removed.
Display data, not syntax
Consider a field that contains two runway designators with a hyphen connecting 
them, e.g.,
24L-36R
"Runway 24 left, 36 right."
The field is a composite of a runway designator, hyphen, runway designator. A 
literal modeling of that field yields this output:
<Runway_Designator_1>24L</Runway_Designator_1>
<Hyphen>-</Hyphen>
<Runway_Designator_2>36R</Runway_Designator_2>

However, the hyphen provides no meaningful information. Design the DFDL schema 
to output only meaningful information and hide everything else:
<Runway_Designator_1>24L</Runway_Designator_1>
<Runway_Designator_2>36R</Runway_Designator_2>

I welcome your comments!

Reply via email to