Thank you Mike for the detailed and excellent explanation.

But I disagree.

I think DFDL got it wrong with regard to regexes.


  *   In DFDL, the lengthKind 'pattern' was added as a *hack* to cope with 
things we couldn't come up with any better way to handle. It is intended to be 
a last resort

That is sad. Regexes are fundamental in every other parsing tool, both at a 
practical level and at a theory level.

I am creating 350 DFDL schemas, one for each of the 350 USMTF messages. Each 
USMTF message already has an existing XML Schema, so I am simply adding the 
appropriate DFDL properties to the schemas. The XML Schemas specify each field 
via a regex. So, the obvious way to implement the DFDL schemas is to use 
dfdl:lengthKind=”pattern” and dfdl:lengthPattern=”regex” where “regex” is the 
regex already provided by the XML Schema.


  *   To get a good format description in DFDL, dfdl:lengthKind pattern must be 
used carefully and minimally.

I am auto-generating the 350 DFDL schema using a tool I wrote. I am using 
dfdl:lengthKind=”pattern” and dfdl:lengthPattern=”regex” to the *maximal* 
extent.

I recommend changing the DFDL specification. Regexes should be a first class 
citizen, not a “hack.”

/Roger


From: Mike Beckerle <mbecke...@apache.org>
Sent: Sunday, May 1, 2022 5:44 PM
To: users@daffodil.apache.org
Subject: [EXT] Re: Bug in Daffodil?

re: "Doesn't that lengthPattern mean, "The allowable values for this element 
are foo, bar, or dash?"

No. The length of the match of the lengthPattern isolates the content region 
for this element in the data grammar. No match means length 0.

I.e., the dfdl:lengthPattern property is about determining the length of the 
representation of the element. It is only about the length.

The dfdl:lengthPattern is NOT, in general, a statement about the value. 
Coincidently, if the type is string, then there may be overlap in the 
lengthPattern regex between string values and logical values or literal nil 
values that the strings must contain. But the best way to think about 
lengthPattern is to ignore the value itself and use lookahead/lookbehind regex 
features to find out what must terminate the data, i.e., what must appear after 
it. That's the primary intended use case for lengthKind 'pattern' not to 
recognize valid allowed data, but to scan past it for things that indicate 
where it ends.

Determining length is a key concept in DFDL.  You can do nothing pretty much 
until you determine length. You haven't isolated what data you are even talking 
about until length determination is over. Then you have to determine the 
difference between content and value regions within the data (due to padding 
typically) and then whether it is the nil, empty, or normal representation. 
Then, if it is normal representation, you can start talking about what regex 
the value must match if it is a string (via regular XSD pattern facet, which 
are about the string value - now isolated from the data stream), what 
calendar-pattern it must match if it is a date/time, what boolean value it 
converts to by way of the textBooleanXYZZY properties, etc.

Determining length is a key concept to understand the difference between "well 
formed" data and "valid" data. A string is well formed if it can be isolated 
properly from the data stream i.e., we can determine which characters/bytes of 
the data stream *should* be the data and talk about how that data is invalid. 
If we can't even figure out which characters/bytes of the data stream should 
even be considered to be the element in question, that's what we mean by 
"malformed" data.

To get a good format description in DFDL, dfdl:lengthKind pattern must be used 
carefully and minimally.

A format description language that handles textual data format description as a 
BNF grammar with interspersed regular expressions is a potentially useful 
concept.

DFDL is *not* that language.

In DFDL, the lengthKind 'pattern' was added as a *hack* to cope with things we 
couldn't come up with any better way to handle. It is intended to be a last 
resort for formats that are otherwise impossible to model. It is, for example, 
to handle the situation in USMTF where "//" is a terminator, except since the 
internet came around we now must allow content like 
"http://some.domain.foo/url/syntax";, which contains a "//" hence, lengthKind 
pattern can be used to end a field with a "//" that is not preceded by ":", 
using the look-ahead and negative look-behind regex features.

That's what lengthKind pattern is for. Not for recognizing allowed string 
values. XSD pattern facets are for recognizing allowed string values.

-mikeb










On Wed, Apr 27, 2022 at 6:10 PM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
Hi Steve,

> dfdl:lengthPattern="foo|bar|-"

That's really interesting. In my data format, the dash is to be used only to 
indicate there is no data available. Doesn't that lengthPattern mean, "The 
allowable values for this element are foo, bar, or dash"? If I use that 
lengthPattern, is there any reason to use nillable="true" and dfdl:nilValue="-"?

/Roger

-----Original Message-----
From: Steve Lawrence <slawre...@apache.org<mailto:slawre...@apache.org>>
Sent: Wednesday, April 27, 2022 3:04 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: [EXT] Re: Bug in Daffodil?

Your pattern length must include something that matches the nil content
as well, otherwise Daffodil doesn't actaully know how long your nil
content is. So your pattern needs to look something like this:

   dfdl:lengthPattern="foo|bar|-"

Additionally, because the "A" element could be nilled, you also need to
update your assertion. This is because when an element is nilled it
doesn't actually have a value, so accessing the value to compare it to
the empty string will cause an SDE. Instead, your assertion wants to be
something like this:

   <dfdl:assert test="{ fn:nilled(.) or . ne '' }"/>

This asserts that either your element is nilled or its value is not the
empty string.

- Steve

On 4/27/22 2:11 PM, Roger L Costello wrote:
> Hi Folks,
>
> My input consists of one field terminated by //
>
> The value of the field is either foo or bar.
>
> Here is a sample input:
>
> foo//
>
> My DFDL schema works fine with that input.
>
> The field is nillable and the nilValue is a hyphen. Here is a valid input:
>
> -//
>
> My DFDL schema fails with that input.
>
> I specify the field using dfdl:lengthKind="pattern" and 
> dfdl:lengthPattern="foo|bar"
>
> Below is my DFDL schema. Am I doing something wrong or is this a bug in 
> Daffodil? If so, is there a workaround?  /Roger
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"; 
> xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"; elementFormDefault="qualified">
>      <xs:annotation>
>          <xs:appinfo source="http://www.ogf.org/dfdl/";>
>              <dfdl:format
>                  alignment="1"
>                  alignmentUnits="bytes"
>                  emptyValueDelimiterPolicy="none"
>                  encoding="ASCII"
>                  encodingErrorPolicy="replace"
>                  escapeSchemeRef=""
>                  fillByte="%SP;"
>                  floating="no"
>                  ignoreCase = "yes"
>                  initiatedContent="no"
>                  initiator = ""
>                  leadingSkip="0"
>                  lengthKind = "delimited"
>                  lengthUnits="characters"
>                  nilKind="literalValue"
>                  nilValue="-"
>                  nilValueDelimiterPolicy="none"
>                  occursCountKind="implicit"
>                  outputNewLine="%CR;%LF;"
>                  representation="text"
>                  separator=""
>                  separatorSuppressionPolicy="anyEmpty"
>                  sequenceKind="ordered"
>                  textBidi="no"
>                  textPadKind="none"
>                  textTrimKind="none"
>                  trailingSkip="0"
>                  truncateSpecifiedLengthString="no"
>                  terminator = ""
>                  textNumberRep="standard"
>                  textStandardBase="10"
>                  textStandardZeroRep="0"
>                  textNumberRounding="pattern"
>                  textStandardExponentRep="E"
>                  textNumberCheckPolicy="strict"
>              />
>          </xs:appinfo>
>      </xs:annotation>
>
>      <xs:element name="Test" dfdl:terminator="//">
>          <xs:complexType>
>              <xs:sequence dfdl:separator="/" dfdl:separatorPosition="infix">
>                  <xs:element name="A" type="non-zero-length-string" 
> nillable="true"
>                                        dfdl:lengthPattern="foo|bar" 
> dfdl:nilValue="-" />
>              </xs:sequence>
>          </xs:complexType>
>      </xs:element>
>
>      <xs:simpleType name="non-zero-length-string" dfdl:lengthKind="pattern">
>          <xs:annotation>
>              <xs:appinfo source="http://www.ogf.org/dfdl/";>
>                  <dfdl:assert test="{ . ne '' }"/>
>              </xs:appinfo>
>          </xs:annotation>
>          <xs:restriction base="xs:string"/>
>      </xs:simpleType>
>
> </xs:schema>

Reply via email to