Thanks Mike. That is contrary to the way that regexes work in XSD. For example, here I list the regex choice alternatives shortest to longest:
<xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema> <xs:element name="test"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="GET|GETVALUE"/> </xs:restriction> </xs:simpleType> </xs:element> </xs:schema> This XML document validates against the XSD: <test>GETVALUE</test> So does this one: <test>GET</test> Here I changed the XSD to list the regex choice alternatives longest to shortest: <xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema> <xs:element name="test"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="GETVALUE|GET"/> </xs:restriction> </xs:simpleType> </xs:element> </xs:schema> Again, both XML documents validate against the schema. JSON and JSON Schema behave the same way. So does Flex and Bison. I find the behavior of Daffodil to be quite different than any regex engine that I've ever used. /Roger From: Beckerle, Mike <mbecke...@owlcyberdefense.com> Sent: Wednesday, April 6, 2022 2:05 PM To: Roger L Costello <coste...@mitre.org>; users@daffodil.apache.org Subject: [EXT] Re: Bug in Daffodil On that page, paragraph 4 under the heading "Remember That the Regex Engine is Eager" "I already explained that the regex engine is eager<https://www.regular-expressions.info/engine.html>. It stops searching as soon as it finds a valid match. The consequence is that in certain situations, the order of the alternatives matters. " It then goes on to explain how "Get|GetValue" the order matters, and that the 2nd alternative is only tried if the first one fails. ________________________________ From: Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>> Sent: Wednesday, April 6, 2022 1:54 PM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> <users@daffodil.apache.org<mailto:users@daffodil.apache.org>>; Beckerle, Mike <mbecke...@owlcyberdefense.com<mailto:mbecke...@owlcyberdefense.com>> Subject: Re: Bug in Daffodil Hi Mike, I read the web page you referenced. I don't see where it says that the order of regex choice alternatives matter. Would you quote the sentence that says that, please? /Roger From: Mike Beckerle <mbecke...@apache.org<mailto:mbecke...@apache.org>> Sent: Wednesday, April 6, 2022 1:48 PM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> Subject: [EXT] Re: Bug in Daffodil This is standard regex behavior. Order of the regex choice alternatives matters very much. Authors of regex must organize for longest matches to be attempted first. See: https://www.regular-expressions.info/alternation.html This is one of the reasons DFDL delimiters don't let you just write a regex. For delimiter matching, DFDL insists on the longest match, where standard regex behavior is simply "always eager" behavior. On Wed, Apr 6, 2022 at 1:34 PM Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>> wrote: With this input: GENTEXT/FOO/TAS// The following DFDL generates the dreaded "Left over data" error: <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" dfdl:terminator="//"> <xs:complexType> <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix"> <xs:element name="TextIndicator" minOccurs="0" nillable="true" type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/> <xs:element name="FreeText" minOccurs="0" nillable="true" type="non-zero-length-string" dfdl:lengthPattern="[A-Z]|([A-Z][/A-Z]*[A-Z])"/> </xs:sequence> </xs:complexType> </xs:element> If I reverse the regex for FreeText: <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" dfdl:terminator="//"> <xs:complexType> <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix"> <xs:element name="TextIndicator" minOccurs="0" nillable="true" type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/> <xs:element name="FreeText" minOccurs="0" nillable="true" type="non-zero-length-string" dfdl:lengthPattern="([A-Z][/A-Z]*[A-Z])|[A-Z]"/> </xs:sequence> </xs:complexType> </xs:element> Then the error goes away. This seems like a bug in Daffodil. The order in which a regex OR clause is expressed should not matter. /Roger