Re: Minimalist DFDL

Beckerle, Mike Mon, 02 Aug 2021 17:04:22 -0700

It is interesting to try to do this sort of minimalist thing. I'm not sure 
making DFDL easier for people who are sophisticated programmers who have used 
lex/yacc/bison/antlr is the right goal however.


I want DFDL to be easier for people who couldn't possibly understand or use 
those things, which are, btw. far more powerful than DFDL, which is more 
limited entirely on-purpose.

I do understand why you would introduce DFDL with only strings as the sole data 
type, because then you can get through the concepts of how the language hangs 
together without getting bogged down in all the properties that are type 
specific.

Those could then be studied separately when needed.

The basic concept of delimited or fixed length text are relatively easy to 
understand and use.
But the concept of "length" is very central to data format, and that implies 
the concept of integers.

I would suggest:

1) add type xs:integer also - only text, base 10, standard.
2) drop dfdl:lengthKind='pattern' and regex.
3) add dfdl:outputValueCalc

My rationale follows.

You said in your description of the text of the runway width, and I quote: "The 
value of width is 0-999 and is right-justified in a 4-character field". So you 
are talking right there about integers, and about text justification in a 
fixed-length environment.

Why make the user disguise all that in regular-expression complexity?

Why not use exactly the concepts and terms you used in your sentence:  integers 
with numeric range, and vocabulary like "right justified"? Here's the 
properties and facets needed:

type="xs:integer"
dfdl:lengthKind="explicit"
dfdl:length="4"
dfdl:textNumberJustification="right"
xs:minInclusive value="0"
xs:maxInclusive value="999"

These seem pretty well motivated.

btw: I still don't understand the regex you created for runway width.  Why not:
"\ {1,3}(?:[1-9]\d\d|[1-9]\d|\d)"

Professional programmers pretty much universally share the experience of 
finding regular expressions troublesome and difficult to use in almost any 
context. (See countless web articles like: 
http://www.ilian.io/the-road-to-hell-is-paved-with-regular-expressions/) They 
are a useful tool that is hard to use in practice.

DFDL is supposed to be much easier than regular expressions. I believe it can 
be easier if taught with the proper elaboration of concepts in the right order.

Well, the above is my rant about regular expressions.

As examples of things that cannot be expressed in your minimal DFDL:  anything 
with stored length or count information.  E.g., this data:

5
foo
bar
baz
quux
blah

That 5 is the count of how many. You can't unparse this without 
dfdl:outputValueCalc to lay down the 5 by counting that the array of elements 
has length 5 at unparse time. i.e.,

dfdl:outputValueCalc='{ fn:count(../theArray) }'

Strings with prefix lengths also cannot be expressed without 
lengthKind='explicit' (and outputValueCalc for unparsing)

For example this data is 2 fixed length 8-digit numbers followed by a 
variable-length string with a 2-digit stored length.

123456781234567807abcdefg

another example would be an 80 character record, containing two fixed length 
8-digit numbers followed by a variable length string with 2-digit stored length 
like this:

123456781234567807abcdefg*******************************************************

In that case the 07 says what part of the available length is actively used for 
data, but the records are always the full 80 bytes/chars. This is very common.

You have to have something like dfdl:outputValueCalc='{ 
dfdl:valueLength(../theString) }' or you can't unparse this in general.

In the narrow niche of cybersecurity data scanning, if you can restrict the 
processing to things that never change the length nor count of anything, then 
perhaps you don't need dfdl:outputValueCalc. But in general it is needed to 
avoid application code having to know the intricate details of a data format.

-mikeb

________________________________
From: Roger L Costello <coste...@mitre.org>
Sent: Monday, August 2, 2021 1:47 PM
To: users@daffodil.apache.org <users@daffodil.apache.org>
Subject: Minimalist DFDL


Hi Folks,



The learning curve for DFDL is long and steep but I have found a tiny subset of 
DFDL that can be learned in less than a day and has (I believe) all the power 
of Full DFDL. I call the subset Minimalist DFDL. If you have experience with 
other parser generators such as lex/yacc, flex/bison, or ANTLR, then you will 
find their ideas directly apply to Minimalist DFDL. In other words, the 
learning curve drops way down.



The following discussion applies only to text data formats. I haven’t thought 
about a Minimalist DFDL for binary data formats.



The first key point in Minimalist DFDL is that there is only one datatype: 
string. There are no integers, dates, Booleans, decimals, etc. What that means 
is we can ignore all their properties.



The second key point is that every data item can be specified with a regular 
expression (regex).



Let’s jump right in and look at an example. The example illustrates (nearly) 
all the DFDL properties you need to know.

<xs:element name="Runway" dfdl:terminator="--">
    <xs:complexType>
        <xs:sequence dfdl:separator="/" dfdl:separatorPosition="infix">
            <xs:element name="RunwayWidth" type="xs:string"
                dfdl:lengthKind="pattern"
                dfdl:lengthPattern="[ 
]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9])">
                <xs:annotation>
                    <xs:appinfo source="http://www.ogf.org/dfdl/";>
                        <dfdl:assert test="{ fn:string-length(.) eq 4 }"/>
                    </xs:appinfo>
                </xs:annotation>
            </xs:element>
            <xs:element name="RunwayComposition" type="xs:string"
                dfdl:lengthKind="pattern"
                dfdl:lengthPattern="(ASPHALT|BRICK|CLAY|CONCR|GRASS|GRAVEL)[ 
]*">
                <xs:annotation>
                    <xs:appinfo source="http://www.ogf.org/dfdl/";>
                        <dfdl:assert test="{ fn:string-length(.) eq 8 }"/>
                    </xs:appinfo>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>



In my example input documents contain data about a runway: the width of a 
runway followed by the composition of the runway. The value of width is 0-999 
and is right-justified in a 4-character field. 0-999 is specified by this regex:

[0-9]|[1-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-9]

Since we want the width value right justified (i.e., spaces precede the value), 
we need to prepend this to the regex:

[ ]*

yielding this regex:

[ ]*([0-9]|[1-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-9])

There is a bug in Daffodil which prevents that regex from working. However, I 
found that by rotating the parts of the regex that describe 0-999, with the 
part describing the highest value (99[0-9]) first and the part describing the 
smallest value ([0-9]) last, then Daffodil works fine. So the regex is this:

[ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9])

An oddity of DFDL is that zero-length strings match regexes. We don’t want 
that. To prevent that, add a dfdl:assert containing an XPath expression which 
says the string length (remember, everything is a string) must be greater than 
0:

<xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:assert test="{ fn:string-length(.) gt 0 }"/>
    </xs:appinfo>
</xs:annotation>

However, we can do better than that. We know the length must be 4, so let’s 
have the XPath expression state that the string length is 4:

<xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:assert test="{ fn:string-length(.) eq 4 }"/>
    </xs:appinfo>
</xs:annotation>

Next, runway composition. It has an enumeration list of values and is 
left-justified in an 8-character field. The regex is trivial: (be sure to put [ 
]* at the end to left-justify the enumeration value)

(ASPHALT|BRICK|CLAY|CONCR|GRASS|GRAVEL)[ ]*

Again, we add dfdl:assert to prevent zero-length strings from matching:

<xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:assert test="{ fn:string-length(.) eq 8 }"/>
    </xs:appinfo>
</xs:annotation>

The data for runway width and runway composition are separated by a slash. If 
we think of input data as a sentence, then slash is punctuation. We need a way 
to express punctuation, and DFDL does a good job with that via the separator, 
initiator, and terminator properties.

“What about nil values? Don’t we need the DFDL properties associated with 
nillable?” No, we don’t. Nil values can be easily expressed in the regex. For 
example, suppose that when there is no runway width data then the field must 
contain a hyphen (with spaces to create a field with 4 characters). That is 
easily incorporated into the regex using a regex choice:

([ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9]))| [ ]*\-[ ]*

“What about escaping values? Don’t we need the DFDL properties associated with 
escapes?” No, we don’t. Escaping is something that was resolved long ago with 
regexes.

“Isn’t it purer to treat numbers as numbers, dates as dates, etc. rather than 
treating everything as strings?” Purer? What does that mean? It is a 
meaningless term. Does Minimalist DFDL get the job done (parsing and 
unparsing)? If so, that’s all that matters. I’ll take simplicity over purity 
any day.

“Does Minimalist DFDL work with both parsing and unparsing?” Yes. Beautifully.

“What about hidden groups, your example doesn’t have that; are you saying that 
hidden groups aren’t needed?” No, hidden groups are useful. Other things not 
shown but needed include occursCountKind="implicit", dfdl:choiceLengthKind, 
dfdl:choiceLength.

“What about the DFDL transformation properties such as inputValueCalc, aren’t 
they needed?” No. The Minimalist DFDL philosophy is that it is a parsing 
language, not a transformation language. If you need to do transformations, 
then do it after parsing (using something other than DFDL).

“Aren’t regexes hard to read, write, and maintain?” Well, they are, but I’ll 
make five points (1) their complexity can be managed through various naming 
mechanisms (e.g., use the XML ENTITY mechanism to create named regexes), (2) 
regexes have been around a long time, are well-understood with lots of 
excellent regex processors, and are widely used throughout the programming 
community (i.e., there exists a large pool of people who understand regexes), 
(4) regexes provide razor-sharp precision (no fuzziness/ambiguity), and  (5) 
despite their complexity they are a whole lot easier than having to deal with a 
ton of DFDL properties.

I welcome your comments. Are there text data formats that can be specified 
using Full DFDL that cannot be specified using Minimalist DFDL? Concrete 
examples would be appreciated.

/Roger

Re: Minimalist DFDL

Reply via email to