Thanks Mike. That is contrary to the way that regexes work in XSD.  For 
example, here I list the regex choice alternatives shortest to longest:

<xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema>
    <xs:element name="test">
        <xs:simpleType>
            <xs:restriction  base="xs:string">
                <xs:pattern value="GET|GETVALUE"/>
            </xs:restriction>
        </xs:simpleType>
    </xs:element>
</xs:schema>

This XML document validates against the XSD:

<test>GETVALUE</test>

So does this one:

<test>GET</test>

Here I changed the XSD to list the regex choice alternatives longest to 
shortest:

<xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema>
    <xs:element name="test">
        <xs:simpleType>
            <xs:restriction  base="xs:string">
                <xs:pattern value="GETVALUE|GET"/>
            </xs:restriction>
        </xs:simpleType>
    </xs:element>
</xs:schema>

Again, both XML documents validate against the schema.

JSON and JSON Schema behave the same way.

So does Flex and Bison.

I find the behavior of Daffodil to be quite different than any regex engine 
that I've ever used.

/Roger

From: Beckerle, Mike <mbecke...@owlcyberdefense.com>
Sent: Wednesday, April 6, 2022 2:05 PM
To: Roger L Costello <coste...@mitre.org>; users@daffodil.apache.org
Subject: [EXT] Re: Bug in Daffodil

On that page, paragraph 4 under the heading "Remember That the Regex Engine is 
Eager"

"I already explained that the regex engine is 
eager<https://www.regular-expressions.info/engine.html>. It stops searching as 
soon as it finds a valid match. The consequence is that in certain situations, 
the order of the alternatives matters. "

It then goes on to explain how "Get|GetValue" the order matters, and that the 
2nd alternative is only tried if the first one fails.

________________________________
From: Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>>
Sent: Wednesday, April 6, 2022 1:54 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> 
<users@daffodil.apache.org<mailto:users@daffodil.apache.org>>; Beckerle, Mike 
<mbecke...@owlcyberdefense.com<mailto:mbecke...@owlcyberdefense.com>>
Subject: Re: Bug in Daffodil


Hi Mike,



I read the web page you referenced. I don't see where it says that the order of 
regex choice alternatives matter. Would you quote the sentence that says that, 
please?



/Roger



From: Mike Beckerle <mbecke...@apache.org<mailto:mbecke...@apache.org>>
Sent: Wednesday, April 6, 2022 1:48 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: [EXT] Re: Bug in Daffodil



This is standard regex behavior. Order of the regex choice alternatives matters 
very much. Authors of regex must organize for longest matches to be attempted 
first.



See: https://www.regular-expressions.info/alternation.html



This is one of the reasons DFDL delimiters don't let you just write a regex.



For delimiter matching, DFDL insists on the longest match, where standard regex 
behavior is simply "always eager" behavior.











On Wed, Apr 6, 2022 at 1:34 PM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:

With this input:



GENTEXT/FOO/TAS//



The following DFDL generates the dreaded "Left over data" error:



<xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" 
dfdl:terminator="//">
    <xs:complexType>
        <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix">
            <xs:element name="TextIndicator" minOccurs="0" nillable="true" 
type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/>
            <xs:element name="FreeText" minOccurs="0" nillable="true" 
type="non-zero-length-string" dfdl:lengthPattern="[A-Z]|([A-Z][/A-Z]*[A-Z])"/>
        </xs:sequence>
    </xs:complexType>
</xs:element>



If I reverse the regex for FreeText:



<xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" 
dfdl:terminator="//">
    <xs:complexType>
        <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix">
            <xs:element name="TextIndicator" minOccurs="0" nillable="true" 
type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/>
            <xs:element name="FreeText" minOccurs="0" nillable="true" 
type="non-zero-length-string" dfdl:lengthPattern="([A-Z][/A-Z]*[A-Z])|[A-Z]"/>
        </xs:sequence>
    </xs:complexType>
</xs:element>



Then the error goes away.



This seems like a bug in Daffodil. The order in which a regex OR clause is 
expressed should not matter.



/Roger

Reply via email to