Design Principles I follow these design principles: 1. Say nothing when you have nothing to say. 2. Don't stop parsing when invalid input is encountered but do generate warning messages. 3. Use the strongest datatype possible. 4. Display data, not syntax.
Say nothing when you have nothing to say Oftentimes a special symbol is inserted into a data field to indicate that no data is available to populate the field. For example, a document might contain a person's first name, middle name, and last name. If no data is available for the middle name, then populate the field with N/A. Here is a sample document: John, N/A, Smith John Smith has no middle name. One approach to modeling a field with no data is to treat the special symbol as a nil value and output an element for the field and use xsi:nil="true" to indicate that there was no data available, e.g., <MiddleName xsi:nil="true"/> An alternate approach is to output no element when there is no data available, e.g., do not output a <MiddleName> element. This is achieved using this DFDL idiom: <xs:choice> <xs:sequence dfdl:initiator="N/A"/> <xs:element name="MiddleName" type="xs:string"/> </xs:choice> If the data field contains N/A then do not output an XML element. Otherwise, output a <MiddleName> element, populated with the person's middle name. This is the approach I take. Don't stop parsing when invalid input is encountered but do generate warning messages Sometimes a data field has a restricted set of values. For example, a field representing the color of a banana might be restricted to the values yellow, green, and brown. If the field contains another color, say blue, that value is well-formed but not valid. One approach to dealing with well-formed but invalid data is to halt parsing as soon as invalid data is detected. That's the behavior you get when you use the DFDL assert/checkConstraints properties. An alternate approach is to design the DFDL schema to specify the legal values using xs:enumeration facets and then run Daffodil with the flag --validation limited. That flag instructs Daffodil to generate a warning when data is encountered that doesn't conform to the facets but continue parsing. Use the strongest datatype possible For text data formats it is tempting to model everything as a string. It's simple. You can precisely specify the form of the data using a regular expression (regex). For example, a field containing a latitude degrees value could be modeled as a string along with a regex to constrain the string to two integer digits: <xs:element name="LatitudeDegrees" dfdl:lengthKind="explicit" dfdl:length="2"> <xs:simpleType> <xs:restriction base="validString"> <xs:pattern value="[0-9]{2}"/> </xs:restriction> </xs:simpleType> </xs:element> Then this input - 05 - will yield this output <LatitudeDegrees>05</LatitudeDegrees>. Notice the 0 in the output. For numeric values we would expect leading zeros to be removed. However, since data field has been modeled as a string, there is no concept of normalizing numeric values. For that, you need to use a numeric datatype. A better approach is to use the strongest datatype possible. For the example, use an xs:unsignedInt datatype: <xs:element name="LatitudeDegrees" type="xs:unsignedInt" dfdl:lengthKind="explicit" dfdl:length="2" dfdl:textNumberRep="standard" dfdl:textNumberCheckPolicy="strict" dfdl:textNumberPattern="#" dfdl:textStandardGroupingSeparator="," dfdl:textStandardDecimalSeparator="." dfdl:textStandardBase="10" dfdl:textNumberRounding="pattern"> </xs:element> This input - 05 -yields this output <LatitudeDegrees>5</LatitudeDegrees>. Notice the leading 0 has been removed. Display data, not syntax Consider a field that contains two runway designators with a hyphen connecting them, e.g., 24L-36R "Runway 24 left, 36 right." The field is a composite of a runway designator, hyphen, runway designator. A literal modeling of that field yields this output: <Runway_Designator_1>24L</Runway_Designator_1> <Hyphen>-</Hyphen> <Runway_Designator_2>36R</Runway_Designator_2> However, the hyphen provides no meaningful information. Design the DFDL schema to output only meaningful information and hide everything else: <Runway_Designator_1>24L</Runway_Designator_1> <Runway_Designator_2>36R</Runway_Designator_2> I welcome your comments!