Hi Folks, The learning curve for DFDL is long and steep but I have found a tiny subset of DFDL that can be learned in less than a day and has (I believe) all the power of Full DFDL. I call the subset Minimalist DFDL. If you have experience with other parser generators such as lex/yacc, flex/bison, or ANTLR, then you will find their ideas directly apply to Minimalist DFDL. In other words, the learning curve drops way down.
The following discussion applies only to text data formats. I haven't thought about a Minimalist DFDL for binary data formats. The first key point in Minimalist DFDL is that there is only one datatype: string. There are no integers, dates, Booleans, decimals, etc. What that means is we can ignore all their properties. The second key point is that every data item can be specified with a regular expression (regex). Let's jump right in and look at an example. The example illustrates (nearly) all the DFDL properties you need to know. <xs:element name="Runway" dfdl:terminator="--"> <xs:complexType> <xs:sequence dfdl:separator="/" dfdl:separatorPosition="infix"> <xs:element name="RunwayWidth" type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern="[ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9])"> <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:assert test="{ fn:string-length(.) eq 4 }"/> </xs:appinfo> </xs:annotation> </xs:element> <xs:element name="RunwayComposition" type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern="(ASPHALT|BRICK|CLAY|CONCR|GRASS|GRAVEL)[ ]*"> <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:assert test="{ fn:string-length(.) eq 8 }"/> </xs:appinfo> </xs:annotation> </xs:element> </xs:sequence> </xs:complexType> </xs:element> In my example input documents contain data about a runway: the width of a runway followed by the composition of the runway. The value of width is 0-999 and is right-justified in a 4-character field. 0-999 is specified by this regex: [0-9]|[1-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-9] Since we want the width value right justified (i.e., spaces precede the value), we need to prepend this to the regex: [ ]* yielding this regex: [ ]*([0-9]|[1-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-9]) There is a bug in Daffodil which prevents that regex from working. However, I found that by rotating the parts of the regex that describe 0-999, with the part describing the highest value (99[0-9]) first and the part describing the smallest value ([0-9]) last, then Daffodil works fine. So the regex is this: [ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9]) An oddity of DFDL is that zero-length strings match regexes. We don't want that. To prevent that, add a dfdl:assert containing an XPath expression which says the string length (remember, everything is a string) must be greater than 0: <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:assert test="{ fn:string-length(.) gt 0 }"/> </xs:appinfo> </xs:annotation> However, we can do better than that. We know the length must be 4, so let's have the XPath expression state that the string length is 4: <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:assert test="{ fn:string-length(.) eq 4 }"/> </xs:appinfo> </xs:annotation> Next, runway composition. It has an enumeration list of values and is left-justified in an 8-character field. The regex is trivial: (be sure to put [ ]* at the end to left-justify the enumeration value) (ASPHALT|BRICK|CLAY|CONCR|GRASS|GRAVEL)[ ]* Again, we add dfdl:assert to prevent zero-length strings from matching: <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:assert test="{ fn:string-length(.) eq 8 }"/> </xs:appinfo> </xs:annotation> The data for runway width and runway composition are separated by a slash. If we think of input data as a sentence, then slash is punctuation. We need a way to express punctuation, and DFDL does a good job with that via the separator, initiator, and terminator properties. "What about nil values? Don't we need the DFDL properties associated with nillable?" No, we don't. Nil values can be easily expressed in the regex. For example, suppose that when there is no runway width data then the field must contain a hyphen (with spaces to create a field with 4 characters). That is easily incorporated into the regex using a regex choice: ([ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9]))| [ ]*\-[ ]* "What about escaping values? Don't we need the DFDL properties associated with escapes?" No, we don't. Escaping is something that was resolved long ago with regexes. "Isn't it purer to treat numbers as numbers, dates as dates, etc. rather than treating everything as strings?" Purer? What does that mean? It is a meaningless term. Does Minimalist DFDL get the job done (parsing and unparsing)? If so, that's all that matters. I'll take simplicity over purity any day. "Does Minimalist DFDL work with both parsing and unparsing?" Yes. Beautifully. "What about hidden groups, your example doesn't have that; are you saying that hidden groups aren't needed?" No, hidden groups are useful. Other things not shown but needed include occursCountKind="implicit", dfdl:choiceLengthKind, dfdl:choiceLength. "What about the DFDL transformation properties such as inputValueCalc, aren't they needed?" No. The Minimalist DFDL philosophy is that it is a parsing language, not a transformation language. If you need to do transformations, then do it after parsing (using something other than DFDL). "Aren't regexes hard to read, write, and maintain?" Well, they are, but I'll make five points (1) their complexity can be managed through various naming mechanisms (e.g., use the XML ENTITY mechanism to create named regexes), (2) regexes have been around a long time, are well-understood with lots of excellent regex processors, and are widely used throughout the programming community (i.e., there exists a large pool of people who understand regexes), (4) regexes provide razor-sharp precision (no fuzziness/ambiguity), and (5) despite their complexity they are a whole lot easier than having to deal with a ton of DFDL properties. I welcome your comments. Are there text data formats that can be specified using Full DFDL that cannot be specified using Minimalist DFDL? Concrete examples would be appreciated. /Roger