Alternate DFDL lengthKind enums have been suggested such as
dfdl:lengthKind="valuePattern" intended to support the behavior you have
described.

However, I was unable to find a record of this so I created a JIRA ticket
to track this.

https://issues.apache.org/jira/browse/DAFFODIL-2692



On Mon, May 2, 2022 at 11:20 AM Roger L Costello <coste...@mitre.org> wrote:

> Thank you Mike for the detailed and excellent explanation.
>
>
>
> But I disagree.
>
>
>
> I think DFDL got it wrong with regard to regexes.
>
>
>
>    - In DFDL, the lengthKind 'pattern' was added as a *hack* to cope with
>    things we couldn't come up with any better way to handle. It is intended to
>    be a last resort
>
>
>
> That is sad. Regexes are fundamental in every other parsing tool, both at
> a practical level and at a theory level.
>
>
>
> I am creating 350 DFDL schemas, one for each of the 350 USMTF messages.
> Each USMTF message already has an existing XML Schema, so I am simply
> adding the appropriate DFDL properties to the schemas. The XML Schemas
> specify each field via a regex. So, the obvious way to implement the DFDL
> schemas is to use dfdl:lengthKind=”pattern” and dfdl:lengthPattern=”regex”
> where “regex” is the regex already provided by the XML Schema.
>
>
>
>    - To get a good format description in DFDL, dfdl:lengthKind pattern
>    must be used carefully and minimally.
>
>
>
> I am auto-generating the 350 DFDL schema using a tool I wrote. I am using
> dfdl:lengthKind=”pattern” and dfdl:lengthPattern=”regex” to the **maximal**
> extent.
>
>
>
> I recommend changing the DFDL specification. Regexes should be a first
> class citizen, not a “hack.”
>
>
>
> /Roger
>
>
>
>
>
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Sunday, May 1, 2022 5:44 PM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] Re: Bug in Daffodil?
>
>
>
> re: "Doesn't that lengthPattern mean, "The allowable values for this
> element are foo, bar, or dash?"
>
>
>
> No. The length of the match of the lengthPattern isolates the content
> region for this element in the data grammar. No match means length 0.
>
>
>
> I.e., the dfdl:lengthPattern property is about determining the length of
> the representation of the element. It is only about the length.
>
>
>
> The dfdl:lengthPattern is NOT, in general, a statement about the value.
> Coincidently, if the type is string, then there may be overlap in the
> lengthPattern regex between string values and logical values or literal nil
> values that the strings must contain. But the best way to think about
> lengthPattern is to ignore the value itself and use lookahead/lookbehind
> regex features to find out what must terminate the data, i.e., what must
> appear after it. That's the primary intended use case for lengthKind
> 'pattern' not to recognize valid allowed data, but to scan past it for
> things that indicate where it ends.
>
>
>
> Determining length is a key concept in DFDL.  You can do nothing pretty
> much until you determine length. You haven't isolated what data you are
> even talking about until length determination is over. Then you have to
> determine the difference between content and value regions within the data
> (due to padding typically) and then whether it is the nil, empty, or normal
> representation. Then, if it is normal representation, you can start talking
> about what regex the value must match if it is a string (via regular XSD
> pattern facet, which are about the string value - now isolated from the
> data stream), what calendar-pattern it must match if it is a date/time,
> what boolean value it converts to by way of the textBooleanXYZZY
> properties, etc.
>
>
>
> Determining length is a key concept to understand the difference between
> "well formed" data and "valid" data. A string is well formed if it can be
> isolated properly from the data stream i.e., we can determine which
> characters/bytes of the data stream *should* be the data and talk about how
> that data is invalid. If we can't even figure out which characters/bytes of
> the data stream should even be considered to be the element in question,
> that's what we mean by "malformed" data.
>
>
>
> To get a good format description in DFDL, dfdl:lengthKind pattern must be
> used carefully and minimally.
>
>
>
> A format description language that handles textual data format description
> as a BNF grammar with interspersed regular expressions is a potentially
> useful concept.
>
>
>
> DFDL is *not* that language.
>
>
>
> In DFDL, the lengthKind 'pattern' was added as a *hack* to cope with
> things we couldn't come up with any better way to handle. It is intended to
> be a last resort for formats that are otherwise impossible to model. It is,
> for example, to handle the situation in USMTF where "//" is a terminator,
> except since the internet came around we now must allow content like "
> http://some.domain.foo/url/syntax";, which contains a "//" hence,
> lengthKind pattern can be used to end a field with a "//" that is not
> preceded by ":", using the look-ahead and negative look-behind regex
> features.
>
>
>
> That's what lengthKind pattern is for. Not for recognizing allowed
> string values. XSD pattern facets are for recognizing allowed string
> values.
>
>
>
> -mikeb
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Apr 27, 2022 at 6:10 PM Roger L Costello <coste...@mitre.org>
> wrote:
>
> Hi Steve,
>
> > dfdl:lengthPattern="foo|bar|-"
>
> That's really interesting. In my data format, the dash is to be used only
> to indicate there is no data available. Doesn't that lengthPattern mean,
> "The allowable values for this element are foo, bar, or dash"? If I use
> that lengthPattern, is there any reason to use nillable="true" and
> dfdl:nilValue="-"?
>
> /Roger
>
> -----Original Message-----
> From: Steve Lawrence <slawre...@apache.org>
> Sent: Wednesday, April 27, 2022 3:04 PM
> To: users@daffodil.apache.org
> Subject: [EXT] Re: Bug in Daffodil?
>
> Your pattern length must include something that matches the nil content
> as well, otherwise Daffodil doesn't actaully know how long your nil
> content is. So your pattern needs to look something like this:
>
>    dfdl:lengthPattern="foo|bar|-"
>
> Additionally, because the "A" element could be nilled, you also need to
> update your assertion. This is because when an element is nilled it
> doesn't actually have a value, so accessing the value to compare it to
> the empty string will cause an SDE. Instead, your assertion wants to be
> something like this:
>
>    <dfdl:assert test="{ fn:nilled(.) or . ne '' }"/>
>
> This asserts that either your element is nilled or its value is not the
> empty string.
>
> - Steve
>
> On 4/27/22 2:11 PM, Roger L Costello wrote:
> > Hi Folks,
> >
> > My input consists of one field terminated by //
> >
> > The value of the field is either foo or bar.
> >
> > Here is a sample input:
> >
> > foo//
> >
> > My DFDL schema works fine with that input.
> >
> > The field is nillable and the nilValue is a hyphen. Here is a valid
> input:
> >
> > -//
> >
> > My DFDL schema fails with that input.
> >
> > I specify the field using dfdl:lengthKind="pattern" and
> dfdl:lengthPattern="foo|bar"
> >
> > Below is my DFDL schema. Am I doing something wrong or is this a bug in
> Daffodil? If so, is there a workaround?  /Roger
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"; xmlns:dfdl="
> http://www.ogf.org/dfdl/dfdl-1.0/"; elementFormDefault="qualified">
> >      <xs:annotation>
> >          <xs:appinfo source="http://www.ogf.org/dfdl/";>
> >              <dfdl:format
> >                  alignment="1"
> >                  alignmentUnits="bytes"
> >                  emptyValueDelimiterPolicy="none"
> >                  encoding="ASCII"
> >                  encodingErrorPolicy="replace"
> >                  escapeSchemeRef=""
> >                  fillByte="%SP;"
> >                  floating="no"
> >                  ignoreCase = "yes"
> >                  initiatedContent="no"
> >                  initiator = ""
> >                  leadingSkip="0"
> >                  lengthKind = "delimited"
> >                  lengthUnits="characters"
> >                  nilKind="literalValue"
> >                  nilValue="-"
> >                  nilValueDelimiterPolicy="none"
> >                  occursCountKind="implicit"
> >                  outputNewLine="%CR;%LF;"
> >                  representation="text"
> >                  separator=""
> >                  separatorSuppressionPolicy="anyEmpty"
> >                  sequenceKind="ordered"
> >                  textBidi="no"
> >                  textPadKind="none"
> >                  textTrimKind="none"
> >                  trailingSkip="0"
> >                  truncateSpecifiedLengthString="no"
> >                  terminator = ""
> >                  textNumberRep="standard"
> >                  textStandardBase="10"
> >                  textStandardZeroRep="0"
> >                  textNumberRounding="pattern"
> >                  textStandardExponentRep="E"
> >                  textNumberCheckPolicy="strict"
> >              />
> >          </xs:appinfo>
> >      </xs:annotation>
> >
> >      <xs:element name="Test" dfdl:terminator="//">
> >          <xs:complexType>
> >              <xs:sequence dfdl:separator="/"
> dfdl:separatorPosition="infix">
> >                  <xs:element name="A" type="non-zero-length-string"
> nillable="true"
> >                                        dfdl:lengthPattern="foo|bar"
> dfdl:nilValue="-" />
> >              </xs:sequence>
> >          </xs:complexType>
> >      </xs:element>
> >
> >      <xs:simpleType name="non-zero-length-string"
> dfdl:lengthKind="pattern">
> >          <xs:annotation>
> >              <xs:appinfo source="http://www.ogf.org/dfdl/";>
> >                  <dfdl:assert test="{ . ne '' }"/>
> >              </xs:appinfo>
> >          </xs:annotation>
> >          <xs:restriction base="xs:string"/>
> >      </xs:simpleType>
> >
> > </xs:schema>
>
>

Reply via email to