Minimalist DFDL

Roger L Costello Mon, 02 Aug 2021 10:48:09 -0700

Hi Folks,

The learning curve for DFDL is long and steep but I have found a tiny subset of 
DFDL that can be learned in less than a day and has (I believe) all the power 
of Full DFDL. I call the subset Minimalist DFDL. If you have experience with 
other parser generators such as lex/yacc, flex/bison, or ANTLR, then you will 
find their ideas directly apply to Minimalist DFDL. In other words, the 
learning curve drops way down.


The following discussion applies only to text data formats. I haven't thought 
about a Minimalist DFDL for binary data formats.

The first key point in Minimalist DFDL is that there is only one datatype: 
string. There are no integers, dates, Booleans, decimals, etc. What that means 
is we can ignore all their properties.

The second key point is that every data item can be specified with a regular 
expression (regex).

Let's jump right in and look at an example. The example illustrates (nearly) 
all the DFDL properties you need to know.
<xs:element name="Runway" dfdl:terminator="--">
    <xs:complexType>
        <xs:sequence dfdl:separator="/" dfdl:separatorPosition="infix">
            <xs:element name="RunwayWidth" type="xs:string"
                dfdl:lengthKind="pattern"
                dfdl:lengthPattern="[ 
]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9])">
                <xs:annotation>
                    <xs:appinfo source="http://www.ogf.org/dfdl/";>
                        <dfdl:assert test="{ fn:string-length(.) eq 4 }"/>
                    </xs:appinfo>
                </xs:annotation>
            </xs:element>
            <xs:element name="RunwayComposition" type="xs:string"
                dfdl:lengthKind="pattern"
                dfdl:lengthPattern="(ASPHALT|BRICK|CLAY|CONCR|GRASS|GRAVEL)[ 
]*">
                <xs:annotation>
                    <xs:appinfo source="http://www.ogf.org/dfdl/";>
                        <dfdl:assert test="{ fn:string-length(.) eq 8 }"/>
                    </xs:appinfo>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

In my example input documents contain data about a runway: the width of a 
runway followed by the composition of the runway. The value of width is 0-999 
and is right-justified in a 4-character field. 0-999 is specified by this regex:
[0-9]|[1-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-9]
Since we want the width value right justified (i.e., spaces precede the value), 
we need to prepend this to the regex:
[ ]*
yielding this regex:
[ ]*([0-9]|[1-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-9])
There is a bug in Daffodil which prevents that regex from working. However, I 
found that by rotating the parts of the regex that describe 0-999, with the 
part describing the highest value (99[0-9]) first and the part describing the 
smallest value ([0-9]) last, then Daffodil works fine. So the regex is this:
[ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9])
An oddity of DFDL is that zero-length strings match regexes. We don't want 
that. To prevent that, add a dfdl:assert containing an XPath expression which 
says the string length (remember, everything is a string) must be greater than 
0:
<xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:assert test="{ fn:string-length(.) gt 0 }"/>
    </xs:appinfo>
</xs:annotation>
However, we can do better than that. We know the length must be 4, so let's 
have the XPath expression state that the string length is 4:
<xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:assert test="{ fn:string-length(.) eq 4 }"/>
    </xs:appinfo>
</xs:annotation>
Next, runway composition. It has an enumeration list of values and is 
left-justified in an 8-character field. The regex is trivial: (be sure to put [ 
]* at the end to left-justify the enumeration value)
(ASPHALT|BRICK|CLAY|CONCR|GRASS|GRAVEL)[ ]*
Again, we add dfdl:assert to prevent zero-length strings from matching:
<xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:assert test="{ fn:string-length(.) eq 8 }"/>
    </xs:appinfo>
</xs:annotation>
The data for runway width and runway composition are separated by a slash. If 
we think of input data as a sentence, then slash is punctuation. We need a way 
to express punctuation, and DFDL does a good job with that via the separator, 
initiator, and terminator properties.
"What about nil values? Don't we need the DFDL properties associated with 
nillable?" No, we don't. Nil values can be easily expressed in the regex. For 
example, suppose that when there is no runway width data then the field must 
contain a hyphen (with spaces to create a field with 4 characters). That is 
easily incorporated into the regex using a regex choice:
([ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9]))| [ ]*\-[ ]*
"What about escaping values? Don't we need the DFDL properties associated with 
escapes?" No, we don't. Escaping is something that was resolved long ago with 
regexes.
"Isn't it purer to treat numbers as numbers, dates as dates, etc. rather than 
treating everything as strings?" Purer? What does that mean? It is a 
meaningless term. Does Minimalist DFDL get the job done (parsing and 
unparsing)? If so, that's all that matters. I'll take simplicity over purity 
any day.
"Does Minimalist DFDL work with both parsing and unparsing?" Yes. Beautifully.
"What about hidden groups, your example doesn't have that; are you saying that 
hidden groups aren't needed?" No, hidden groups are useful. Other things not 
shown but needed include occursCountKind="implicit", dfdl:choiceLengthKind, 
dfdl:choiceLength.
"What about the DFDL transformation properties such as inputValueCalc, aren't 
they needed?" No. The Minimalist DFDL philosophy is that it is a parsing 
language, not a transformation language. If you need to do transformations, 
then do it after parsing (using something other than DFDL).
"Aren't regexes hard to read, write, and maintain?" Well, they are, but I'll 
make five points (1) their complexity can be managed through various naming 
mechanisms (e.g., use the XML ENTITY mechanism to create named regexes), (2) 
regexes have been around a long time, are well-understood with lots of 
excellent regex processors, and are widely used throughout the programming 
community (i.e., there exists a large pool of people who understand regexes), 
(4) regexes provide razor-sharp precision (no fuzziness/ambiguity), and  (5) 
despite their complexity they are a whole lot easier than having to deal with a 
ton of DFDL properties.
I welcome your comments. Are there text data formats that can be specified 
using Full DFDL that cannot be specified using Minimalist DFDL? Concrete 
examples would be appreciated.
/Roger

Minimalist DFDL

Reply via email to