Actually, all the regex engines work similarly.

First off, Daffodil simply calls the Java regex engine. It has no regex
engine of its own, and the Java regex engine behaves nearly identically to
JS, VB.net, etc. etc., at least at this level of detail.

This site: https://myregextester.com/index.php lets you test regular
expressions against Java, JS, VB, etc. regex engines (doesn't have XSD as a
choice tho)

When you use a regex to d*etermine the length* of something it is quite
different from when you use a regex to match all of an *already isolated*
string.

In XSD (or JSON Schema), patterns are matching against an already isolated
string. Hence, whatever you put in the pattern behaves as if it was
surrounded by the regex start of data and end-of-data markers.

So try this regex at myregextester.com ^(GET|GETVALUE)$

If you match that against the string "GETVALUE" it succeeds and the match
is "GETVALUE". That's because it tries GET first, and then fails because $
(end of data) is not next.
So it goes back to try the other alternative.  This is why XSD does not
need the regex alternatives ordered.

If you remove the ^ and $ from the regex and match against "GETVALUE" it
also succeeds, but the match is "GET", and it stops there. That's what
Daffodil is doing.

This holds for Java, JS, VB.net, etc.

So a dfdl:lengthPattern simply isn't the same as an XSD pattern facet. They
are for radically different purposes.

The dfdl:lengthPattern is for determining which characters are to be
included in the element.  The eager regex will accept the first match and
stop, having determined the length.

XSD pattern facet is for validating if the already-isolated string of
characters matches, in its *entirety*, the pattern. The end of the data is
already known in this case.

The behavior of dfdl:lengthPattern being sequential is actually really
important. We don't want a lengthPattern match to have to scan to the end
of the entire data stream just to find out that a very long potential match
fails, only to backtrack and find a short match quickly.



On Wed, Apr 6, 2022 at 2:25 PM Roger L Costello <coste...@mitre.org> wrote:

> Thanks Mike. That is contrary to the way that regexes work in XSD.  For
> example, here I list the regex choice alternatives shortest to longest:
>
>
>
> <xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema>
>     <xs:element name="test">
>         <xs:simpleType>
>             <xs:restriction  base="xs:string">
>                 <xs:pattern value="GET|GETVALUE"/>
>             </xs:restriction>
>         </xs:simpleType>
>     </xs:element>
> </xs:schema>
>
>
>
> This XML document validates against the XSD:
>
>
>
> <test>GETVALUE</test>
>
>
>
> So does this one:
>
>
>
> <test>GET</test>
>
>
>
> Here I changed the XSD to list the regex choice alternatives longest to
> shortest:
>
>
>
> <xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema>
>     <xs:element name="test">
>         <xs:simpleType>
>             <xs:restriction  base="xs:string">
>                 <xs:pattern value="GETVALUE|GET"/>
>             </xs:restriction>
>         </xs:simpleType>
>     </xs:element>
> </xs:schema>
>
>
>
> Again, both XML documents validate against the schema.
>
>
>
> JSON and JSON Schema behave the same way.
>
>
>
> So does Flex and Bison.
>
>
>
> I find the behavior of Daffodil to be quite different than any regex
> engine that I’ve ever used.
>
>
>
> /Roger
>
>
>
> *From:* Beckerle, Mike <mbecke...@owlcyberdefense.com>
> *Sent:* Wednesday, April 6, 2022 2:05 PM
> *To:* Roger L Costello <coste...@mitre.org>; users@daffodil.apache.org
> *Subject:* [EXT] Re: Bug in Daffodil
>
>
>
> On that page, paragraph 4 under the heading "Remember That the Regex
> Engine is Eager"
>
>
>
> "I already explained that the regex engine is eager
> <https://www.regular-expressions.info/engine.html>. It stops searching as
> soon as it finds a valid match. The consequence is that in certain
> situations, the order of the alternatives matters. "
>
>
>
> It then goes on to explain how "Get|GetValue" the order matters, and that
> the 2nd alternative is only tried if the first one fails.
>
>
> ------------------------------
>
> *From:* Roger L Costello <coste...@mitre.org>
> *Sent:* Wednesday, April 6, 2022 1:54 PM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>; Beckerle,
> Mike <mbecke...@owlcyberdefense.com>
> *Subject:* Re: Bug in Daffodil
>
>
>
> Hi Mike,
>
>
>
> I read the web page you referenced. I don’t see where it says that the
> order of regex choice alternatives matter. Would you quote the sentence
> that says that, please?
>
>
>
> /Roger
>
>
>
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Wednesday, April 6, 2022 1:48 PM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] Re: Bug in Daffodil
>
>
>
> This is standard regex behavior. Order of the regex choice alternatives
> matters very much. Authors of regex must organize for longest matches to be
> attempted first.
>
>
>
> See: https://www.regular-expressions.info/alternation.html
>
>
>
> This is one of the reasons DFDL delimiters don't let you just write a
> regex.
>
>
>
> For delimiter matching, DFDL insists on the longest match, where standard
> regex behavior is simply "always eager" behavior.
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Apr 6, 2022 at 1:34 PM Roger L Costello <coste...@mitre.org>
> wrote:
>
> With this input:
>
>
>
> GENTEXT/FOO/TAS//
>
>
>
> The following DFDL generates the dreaded “Left over data” error:
>
>
>
> <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT"
> dfdl:terminator="//">
>     <xs:complexType>
>         <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix">
>             <xs:element name="TextIndicator" minOccurs="0" nillable="true"
> type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/>
>             <xs:element name="FreeText" minOccurs="0" nillable="true"
> type="non-zero-length-string"
> dfdl:lengthPattern="[A-Z]|([A-Z][/A-Z]*[A-Z])"/>
>         </xs:sequence>
>     </xs:complexType>
> </xs:element>
>
>
>
> If I reverse the regex for FreeText:
>
>
>
> <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT"
> dfdl:terminator="//">
>     <xs:complexType>
>         <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix">
>             <xs:element name="TextIndicator" minOccurs="0" nillable="true"
> type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/>
>             <xs:element name="FreeText" minOccurs="0" nillable="true"
> type="non-zero-length-string"
> dfdl:lengthPattern="([A-Z][/A-Z]*[A-Z])|[A-Z]"/>
>         </xs:sequence>
>     </xs:complexType>
> </xs:element>
>
>
>
> Then the error goes away.
>
>
>
> This seems like a bug in Daffodil. The order in which a regex OR clause is
> expressed should not matter.
>
>
>
> /Roger
>
>

Reply via email to