Alternate DFDL lengthKind enums have been suggested such as dfdl:lengthKind="valuePattern" intended to support the behavior you have described.
However, I was unable to find a record of this so I created a JIRA ticket to track this. https://issues.apache.org/jira/browse/DAFFODIL-2692 On Mon, May 2, 2022 at 11:20 AM Roger L Costello <coste...@mitre.org> wrote: > Thank you Mike for the detailed and excellent explanation. > > > > But I disagree. > > > > I think DFDL got it wrong with regard to regexes. > > > > - In DFDL, the lengthKind 'pattern' was added as a *hack* to cope with > things we couldn't come up with any better way to handle. It is intended to > be a last resort > > > > That is sad. Regexes are fundamental in every other parsing tool, both at > a practical level and at a theory level. > > > > I am creating 350 DFDL schemas, one for each of the 350 USMTF messages. > Each USMTF message already has an existing XML Schema, so I am simply > adding the appropriate DFDL properties to the schemas. The XML Schemas > specify each field via a regex. So, the obvious way to implement the DFDL > schemas is to use dfdl:lengthKind=”pattern” and dfdl:lengthPattern=”regex” > where “regex” is the regex already provided by the XML Schema. > > > > - To get a good format description in DFDL, dfdl:lengthKind pattern > must be used carefully and minimally. > > > > I am auto-generating the 350 DFDL schema using a tool I wrote. I am using > dfdl:lengthKind=”pattern” and dfdl:lengthPattern=”regex” to the **maximal** > extent. > > > > I recommend changing the DFDL specification. Regexes should be a first > class citizen, not a “hack.” > > > > /Roger > > > > > > *From:* Mike Beckerle <mbecke...@apache.org> > *Sent:* Sunday, May 1, 2022 5:44 PM > *To:* users@daffodil.apache.org > *Subject:* [EXT] Re: Bug in Daffodil? > > > > re: "Doesn't that lengthPattern mean, "The allowable values for this > element are foo, bar, or dash?" > > > > No. The length of the match of the lengthPattern isolates the content > region for this element in the data grammar. No match means length 0. > > > > I.e., the dfdl:lengthPattern property is about determining the length of > the representation of the element. It is only about the length. > > > > The dfdl:lengthPattern is NOT, in general, a statement about the value. > Coincidently, if the type is string, then there may be overlap in the > lengthPattern regex between string values and logical values or literal nil > values that the strings must contain. But the best way to think about > lengthPattern is to ignore the value itself and use lookahead/lookbehind > regex features to find out what must terminate the data, i.e., what must > appear after it. That's the primary intended use case for lengthKind > 'pattern' not to recognize valid allowed data, but to scan past it for > things that indicate where it ends. > > > > Determining length is a key concept in DFDL. You can do nothing pretty > much until you determine length. You haven't isolated what data you are > even talking about until length determination is over. Then you have to > determine the difference between content and value regions within the data > (due to padding typically) and then whether it is the nil, empty, or normal > representation. Then, if it is normal representation, you can start talking > about what regex the value must match if it is a string (via regular XSD > pattern facet, which are about the string value - now isolated from the > data stream), what calendar-pattern it must match if it is a date/time, > what boolean value it converts to by way of the textBooleanXYZZY > properties, etc. > > > > Determining length is a key concept to understand the difference between > "well formed" data and "valid" data. A string is well formed if it can be > isolated properly from the data stream i.e., we can determine which > characters/bytes of the data stream *should* be the data and talk about how > that data is invalid. If we can't even figure out which characters/bytes of > the data stream should even be considered to be the element in question, > that's what we mean by "malformed" data. > > > > To get a good format description in DFDL, dfdl:lengthKind pattern must be > used carefully and minimally. > > > > A format description language that handles textual data format description > as a BNF grammar with interspersed regular expressions is a potentially > useful concept. > > > > DFDL is *not* that language. > > > > In DFDL, the lengthKind 'pattern' was added as a *hack* to cope with > things we couldn't come up with any better way to handle. It is intended to > be a last resort for formats that are otherwise impossible to model. It is, > for example, to handle the situation in USMTF where "//" is a terminator, > except since the internet came around we now must allow content like " > http://some.domain.foo/url/syntax", which contains a "//" hence, > lengthKind pattern can be used to end a field with a "//" that is not > preceded by ":", using the look-ahead and negative look-behind regex > features. > > > > That's what lengthKind pattern is for. Not for recognizing allowed > string values. XSD pattern facets are for recognizing allowed string > values. > > > > -mikeb > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 27, 2022 at 6:10 PM Roger L Costello <coste...@mitre.org> > wrote: > > Hi Steve, > > > dfdl:lengthPattern="foo|bar|-" > > That's really interesting. In my data format, the dash is to be used only > to indicate there is no data available. Doesn't that lengthPattern mean, > "The allowable values for this element are foo, bar, or dash"? If I use > that lengthPattern, is there any reason to use nillable="true" and > dfdl:nilValue="-"? > > /Roger > > -----Original Message----- > From: Steve Lawrence <slawre...@apache.org> > Sent: Wednesday, April 27, 2022 3:04 PM > To: users@daffodil.apache.org > Subject: [EXT] Re: Bug in Daffodil? > > Your pattern length must include something that matches the nil content > as well, otherwise Daffodil doesn't actaully know how long your nil > content is. So your pattern needs to look something like this: > > dfdl:lengthPattern="foo|bar|-" > > Additionally, because the "A" element could be nilled, you also need to > update your assertion. This is because when an element is nilled it > doesn't actually have a value, so accessing the value to compare it to > the empty string will cause an SDE. Instead, your assertion wants to be > something like this: > > <dfdl:assert test="{ fn:nilled(.) or . ne '' }"/> > > This asserts that either your element is nilled or its value is not the > empty string. > > - Steve > > On 4/27/22 2:11 PM, Roger L Costello wrote: > > Hi Folks, > > > > My input consists of one field terminated by // > > > > The value of the field is either foo or bar. > > > > Here is a sample input: > > > > foo// > > > > My DFDL schema works fine with that input. > > > > The field is nillable and the nilValue is a hyphen. Here is a valid > input: > > > > -// > > > > My DFDL schema fails with that input. > > > > I specify the field using dfdl:lengthKind="pattern" and > dfdl:lengthPattern="foo|bar" > > > > Below is my DFDL schema. Am I doing something wrong or is this a bug in > Daffodil? If so, is there a workaround? /Roger > > > > <?xml version="1.0" encoding="UTF-8"?> > > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dfdl=" > http://www.ogf.org/dfdl/dfdl-1.0/" elementFormDefault="qualified"> > > <xs:annotation> > > <xs:appinfo source="http://www.ogf.org/dfdl/"> > > <dfdl:format > > alignment="1" > > alignmentUnits="bytes" > > emptyValueDelimiterPolicy="none" > > encoding="ASCII" > > encodingErrorPolicy="replace" > > escapeSchemeRef="" > > fillByte="%SP;" > > floating="no" > > ignoreCase = "yes" > > initiatedContent="no" > > initiator = "" > > leadingSkip="0" > > lengthKind = "delimited" > > lengthUnits="characters" > > nilKind="literalValue" > > nilValue="-" > > nilValueDelimiterPolicy="none" > > occursCountKind="implicit" > > outputNewLine="%CR;%LF;" > > representation="text" > > separator="" > > separatorSuppressionPolicy="anyEmpty" > > sequenceKind="ordered" > > textBidi="no" > > textPadKind="none" > > textTrimKind="none" > > trailingSkip="0" > > truncateSpecifiedLengthString="no" > > terminator = "" > > textNumberRep="standard" > > textStandardBase="10" > > textStandardZeroRep="0" > > textNumberRounding="pattern" > > textStandardExponentRep="E" > > textNumberCheckPolicy="strict" > > /> > > </xs:appinfo> > > </xs:annotation> > > > > <xs:element name="Test" dfdl:terminator="//"> > > <xs:complexType> > > <xs:sequence dfdl:separator="/" > dfdl:separatorPosition="infix"> > > <xs:element name="A" type="non-zero-length-string" > nillable="true" > > dfdl:lengthPattern="foo|bar" > dfdl:nilValue="-" /> > > </xs:sequence> > > </xs:complexType> > > </xs:element> > > > > <xs:simpleType name="non-zero-length-string" > dfdl:lengthKind="pattern"> > > <xs:annotation> > > <xs:appinfo source="http://www.ogf.org/dfdl/"> > > <dfdl:assert test="{ . ne '' }"/> > > </xs:appinfo> > > </xs:annotation> > > <xs:restriction base="xs:string"/> > > </xs:simpleType> > > > > </xs:schema> > >