Correct, regular expressions like those used in discriminator/assert
testKind="pattern" or in dfdl:lengthKind="pattern" ignore delimiters. Note that
regexs do obey bit limit, so for example if a complex parent has
dfdl:lengthKind="explict" then the regex will not scan passed the length of that
parent.
And note that it's not uncommon to ignore delimiters, it's not just a regex
thing. For example, say we have this schema snippet:
<sequence dfdl:separator="%NL;">
<element name="parent" dfdl:lengthKind="delimited" ...>
<complexType>
<sequence>
<element name="child" dfdl:lengthKind="explicit" ... />
</sequence>
</complexType>
</element>
</sequence>
In this example, the child's explicit length will ignore any delimiters used by
the parent. The NL separator will only be scanned for *after* the child is
parsed. And if an NL exists within the explicit length of the child then it will
just become part of the child's content.
On 2025-04-21 10:32 AM, Adams, Joshua wrote:
Thanks for the explanation Steve, I had been looking into why things behave this
way as well and was a little confused how the patterns in a dfdl:assert were
handled.
It sounds like it is intentional that patterns can go beyond the bounds of local
delimiters then, correct? IE in this example we have a sequence of strings that
are separated by lines, but the pattern for the first element of the sequence
will read beyond the first newline, correct?
I agree that using a pattern restriction is generally preferable, but wanted to
make sure that this behavior of dfdl:assert patterns reaching beyond local
delimiters was intentional.
Josh
--------------------------------------------------------------------------------
*From:* Steve Lawrence <slawre...@apache.org>
*Sent:* Monday, April 21, 2025 8:29 AM
*To:* users@daffodil.apache.org <users@daffodil.apache.org>
*Subject:* Re: issue with using testPattern in an assertion
I think the issue is that with testKind="pattern", the dot wildcard character in
a regex matches newlines--it behaves as if "(?s)" is appended to your regex.
So essentially each time your regex is run it will scan the entire data stream
and will always be a successful match as long as the data has an open
parenthesis somewhere and ends with a close parenthesis, regardless of what line
everything happens on.
One way to fix this is to not use the dot wildcard and use a character class to
ensure your dots only match the characters you expect. There's a number of ways
to do this, but if you want to match everything except newlines, you could do
something like this:
testPattern="[^\r\n]+\([^\r\n]+\)[\r\n]"
So that matches one or more non-newline characters, followed by one or more
non-newline characters wrapped in parenthesis, followed by a newline character.
An alternative approach, which has a number of benefits and is what I would
recommend for this kind of thing, is to use an XSD pattern restriction instead
of a recoverableError assertion, e.g.:
<element name="line" ... >
<simpleType>
<restriction base="xs:string">
<pattern value=".+\(.+\)" />
</restriction>
</simpleType>
</element>
A pattern restriction looks only at the infoset content rather than the
underlying data stream, so you don't have to worry about newlines anymore and
you can use the original regular expression.
This is also nice because it's normal XSD, so other tools can be used to
validate the values of the infoset, instead of relying only on Daffodil's
testPattern. For example, if you add the "--validate on" option in the Daffodil
CLI, it will use Xereces to validate the infoset, which outputs more verbose
validation message like what the string was that failed the pattern restriction.
This is also nice in that if you don't care about validation you can just not
enable the validation option. This can be useful for testing. But there is no
way to disable a testPattern assertion.
On 2025-04-17 02:55 PM, Mark Kozak wrote:
Hello folks.
I am reaching out for a sanity check please.
I am seeing a regular expression behavior that was driving me mad, but may
actually be a bug?
The example below is a simplified version for illustration:
The goal is to check that a line of text starts with a string and ends with
another string in parenthesis.
Using the following data and subsequent schema, only the first line should pass
validation. So I expect to see 5 validation failures. However only the last
line is failed.
Then just to keep things interesting, copy the first line to the end of the
file, and then there are no validation failures at all.
It appears that the assertion is being checked against only the last element in
the sequence. Is that the intended behavior?
I have tried this with 3.6 and 3.9 and get the same results both times.
aaa(111)
bbb
(222)
ccc(333)XXX
()
(444)
<element name="sample">
<complexType>
<sequence dfdl:separator="%NL;" >
<element name="line" dfdl:lengthKind="delimited" type="xs:string"
dfdl:occursCountKind="implicit" maxOccurs="unbounded" >
<annotation>
<appinfo source="http://www.ogf.org/dfdl/
<http://www.ogf.org/dfdl/>">
<dfdl:assert testKind="pattern"
failureType="recoverableError"
testPattern=".+\(.+\)" />
</appinfo>
</annotation>
</element>
</sequence>
</complexType>
</element>
Thank you for the help.
Mark Kozak
Director of Engineering
Adeptus Cyber Solutions
Adeptus-CS.com