Actually, all the regex engines work similarly. First off, Daffodil simply calls the Java regex engine. It has no regex engine of its own, and the Java regex engine behaves nearly identically to JS, VB.net, etc. etc., at least at this level of detail.
This site: https://myregextester.com/index.php lets you test regular expressions against Java, JS, VB, etc. regex engines (doesn't have XSD as a choice tho) When you use a regex to d*etermine the length* of something it is quite different from when you use a regex to match all of an *already isolated* string. In XSD (or JSON Schema), patterns are matching against an already isolated string. Hence, whatever you put in the pattern behaves as if it was surrounded by the regex start of data and end-of-data markers. So try this regex at myregextester.com ^(GET|GETVALUE)$ If you match that against the string "GETVALUE" it succeeds and the match is "GETVALUE". That's because it tries GET first, and then fails because $ (end of data) is not next. So it goes back to try the other alternative. This is why XSD does not need the regex alternatives ordered. If you remove the ^ and $ from the regex and match against "GETVALUE" it also succeeds, but the match is "GET", and it stops there. That's what Daffodil is doing. This holds for Java, JS, VB.net, etc. So a dfdl:lengthPattern simply isn't the same as an XSD pattern facet. They are for radically different purposes. The dfdl:lengthPattern is for determining which characters are to be included in the element. The eager regex will accept the first match and stop, having determined the length. XSD pattern facet is for validating if the already-isolated string of characters matches, in its *entirety*, the pattern. The end of the data is already known in this case. The behavior of dfdl:lengthPattern being sequential is actually really important. We don't want a lengthPattern match to have to scan to the end of the entire data stream just to find out that a very long potential match fails, only to backtrack and find a short match quickly. On Wed, Apr 6, 2022 at 2:25 PM Roger L Costello <coste...@mitre.org> wrote: > Thanks Mike. That is contrary to the way that regexes work in XSD. For > example, here I list the regex choice alternatives shortest to longest: > > > > <xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema> > <xs:element name="test"> > <xs:simpleType> > <xs:restriction base="xs:string"> > <xs:pattern value="GET|GETVALUE"/> > </xs:restriction> > </xs:simpleType> > </xs:element> > </xs:schema> > > > > This XML document validates against the XSD: > > > > <test>GETVALUE</test> > > > > So does this one: > > > > <test>GET</test> > > > > Here I changed the XSD to list the regex choice alternatives longest to > shortest: > > > > <xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema> > <xs:element name="test"> > <xs:simpleType> > <xs:restriction base="xs:string"> > <xs:pattern value="GETVALUE|GET"/> > </xs:restriction> > </xs:simpleType> > </xs:element> > </xs:schema> > > > > Again, both XML documents validate against the schema. > > > > JSON and JSON Schema behave the same way. > > > > So does Flex and Bison. > > > > I find the behavior of Daffodil to be quite different than any regex > engine that I’ve ever used. > > > > /Roger > > > > *From:* Beckerle, Mike <mbecke...@owlcyberdefense.com> > *Sent:* Wednesday, April 6, 2022 2:05 PM > *To:* Roger L Costello <coste...@mitre.org>; users@daffodil.apache.org > *Subject:* [EXT] Re: Bug in Daffodil > > > > On that page, paragraph 4 under the heading "Remember That the Regex > Engine is Eager" > > > > "I already explained that the regex engine is eager > <https://www.regular-expressions.info/engine.html>. It stops searching as > soon as it finds a valid match. The consequence is that in certain > situations, the order of the alternatives matters. " > > > > It then goes on to explain how "Get|GetValue" the order matters, and that > the 2nd alternative is only tried if the first one fails. > > > ------------------------------ > > *From:* Roger L Costello <coste...@mitre.org> > *Sent:* Wednesday, April 6, 2022 1:54 PM > *To:* users@daffodil.apache.org <users@daffodil.apache.org>; Beckerle, > Mike <mbecke...@owlcyberdefense.com> > *Subject:* Re: Bug in Daffodil > > > > Hi Mike, > > > > I read the web page you referenced. I don’t see where it says that the > order of regex choice alternatives matter. Would you quote the sentence > that says that, please? > > > > /Roger > > > > *From:* Mike Beckerle <mbecke...@apache.org> > *Sent:* Wednesday, April 6, 2022 1:48 PM > *To:* users@daffodil.apache.org > *Subject:* [EXT] Re: Bug in Daffodil > > > > This is standard regex behavior. Order of the regex choice alternatives > matters very much. Authors of regex must organize for longest matches to be > attempted first. > > > > See: https://www.regular-expressions.info/alternation.html > > > > This is one of the reasons DFDL delimiters don't let you just write a > regex. > > > > For delimiter matching, DFDL insists on the longest match, where standard > regex behavior is simply "always eager" behavior. > > > > > > > > > > > > On Wed, Apr 6, 2022 at 1:34 PM Roger L Costello <coste...@mitre.org> > wrote: > > With this input: > > > > GENTEXT/FOO/TAS// > > > > The following DFDL generates the dreaded “Left over data” error: > > > > <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" > dfdl:terminator="//"> > <xs:complexType> > <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix"> > <xs:element name="TextIndicator" minOccurs="0" nillable="true" > type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/> > <xs:element name="FreeText" minOccurs="0" nillable="true" > type="non-zero-length-string" > dfdl:lengthPattern="[A-Z]|([A-Z][/A-Z]*[A-Z])"/> > </xs:sequence> > </xs:complexType> > </xs:element> > > > > If I reverse the regex for FreeText: > > > > <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" > dfdl:terminator="//"> > <xs:complexType> > <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix"> > <xs:element name="TextIndicator" minOccurs="0" nillable="true" > type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/> > <xs:element name="FreeText" minOccurs="0" nillable="true" > type="non-zero-length-string" > dfdl:lengthPattern="([A-Z][/A-Z]*[A-Z])|[A-Z]"/> > </xs:sequence> > </xs:complexType> > </xs:element> > > > > Then the error goes away. > > > > This seems like a bug in Daffodil. The order in which a regex OR clause is > expressed should not matter. > > > > /Roger > >