Re: issue with using testPattern in an assertion

Steve Lawrence Mon, 21 Apr 2025 07:53:18 -0700

Correct, regular expressions like those used in discriminator/asserttestKind="pattern" or in dfdl:lengthKind="pattern" ignore delimiters. Note thatregexs do obey bit limit, so for example if a complex parent hasdfdl:lengthKind="explict" then the regex will not scan passed the length of thatparent.

And note that it's not uncommon to ignore delimiters, it's not just a regexthing. For example, say we have this schema snippet:


  <sequence dfdl:separator="%NL;">
    <element name="parent" dfdl:lengthKind="delimited" ...>
      <complexType>
        <sequence>
          <element name="child" dfdl:lengthKind="explicit" ... />
        </sequence>
      </complexType>
    </element>
  </sequence>

In this example, the child's explicit length will ignore any delimiters used bythe parent. The NL separator will only be scanned for *after* the child isparsed. And if an NL exists within the explicit length of the child then it willjust become part of the child's content.


On 2025-04-21 10:32 AM, Adams, Joshua wrote:

Thanks for the explanation Steve, I had been looking into why things behave thisway as well and was a little confused how the patterns in a dfdl:assert werehandled.

It sounds like it is intentional that patterns can go beyond the bounds of localdelimiters then, correct? IE in this example we have a sequence of strings thatare separated by lines, but the pattern for the first element of the sequencewill read beyond the first newline, correct?

I agree that using a pattern restriction is generally preferable, but wanted tomake sure that this behavior of dfdl:assert patterns reaching beyond localdelimiters was intentional.


Josh
--------------------------------------------------------------------------------
*From:* Steve Lawrence <slawre...@apache.org>
*Sent:* Monday, April 21, 2025 8:29 AM
*To:* users@daffodil.apache.org <users@daffodil.apache.org>
*Subject:* Re: issue with using testPattern in an assertion
I think the issue is that with testKind="pattern", the dot wildcard character in
a regex matches newlines--it behaves as if "(?s)" is appended to your regex.

So essentially each time your regex is run it will scan the entire data stream
and will always be a successful match as long as the data has an open
parenthesis somewhere and ends with a close parenthesis, regardless of what line
everything happens on.

One way to fix this is to not use the dot wildcard and use a character class to
ensure your dots only match the characters you expect. There's a number of ways
to do this, but if you want to match everything except newlines, you could do
something like this:

    testPattern="[^\r\n]+\([^\r\n]+\)[\r\n]"

So that matches one or more non-newline characters, followed by one or more
non-newline characters wrapped in parenthesis, followed by a newline character.

An alternative approach, which has a number of benefits and is what I would
recommend for this kind of thing, is to use an XSD pattern restriction instead
of a recoverableError assertion, e.g.:

    <element name="line" ... >
      <simpleType>
        <restriction base="xs:string">
          <pattern value=".+\(.+\)" />
        </restriction>
      </simpleType>
    </element>

A pattern restriction looks only at the infoset content rather than the
underlying data stream, so you don't have to worry about newlines anymore and
you can use the original regular expression.

This is also nice because it's normal XSD, so other tools can be used to
validate the values of the infoset, instead of relying only on Daffodil's
testPattern. For example, if you add the "--validate on" option in the Daffodil
CLI, it will use Xereces to validate the infoset, which outputs more verbose
validation message like what the string was that failed the pattern restriction.

This is also nice in that if you don't care about validation you can just not
enable the validation option. This can be useful for testing. But there is no
way to disable a testPattern assertion.


On 2025-04-17 02:55 PM, Mark Kozak wrote:

Hello folks.

I am reaching out for a sanity check please.
I am seeing a regular expression behavior that was driving me mad, but mayactually be a bug?
The example below is a simplified version for illustration:
The goal is to check that a line of text starts with a string and ends withanother string in parenthesis.
Using the following data and subsequent schema, only the first line should pass
validation. So I expect to see 5 validation failures. However only the lastline is failed.
Then just to keep things interesting, copy the first line to the end of thefile, and then there are no validation failures at all.
It appears that the assertion is being checked against only the last element in
the sequence. Is that the intended behavior?

I have tried this with 3.6 and 3.9 and get the same results both times.

aaa(111)

bbb

(222)

ccc(333)XXX

()

(444)

    <element name="sample">

      <complexType>

        <sequence dfdl:separator="%NL;" >
<element name="line" dfdl:lengthKind="delimited" type="xs:string"dfdl:occursCountKind="implicit" maxOccurs="unbounded" >
              <annotation>

                  <appinfo source="http://www.ogf.org/dfdl/ 
<http://www.ogf.org/dfdl/>">

                      <dfdl:assert testKind="pattern" 
failureType="recoverableError"

                          testPattern=".+\(.+\)" />

                  </appinfo>

              </annotation>

          </element>

        </sequence>

      </complexType>

    </element>

Thank you for the help.

Mark Kozak

Director of Engineering

Adeptus Cyber Solutions

Adeptus-CS.com

Re: issue with using testPattern in an assertion

Reply via email to