I thought restriction pattern facet alternatives don't need to be
sorted? This is because the pattern must match the entire value of the
element--there's essentially an implied ^ and $ surrounding the pattern.
So for example, if the pattern facet was "(ab|abc.*)" and an infoset of
<foo>abcxyz</foo>
The first alternative would match the start of the value, but not the
entire value. And so the second alternative would be tried and would
match, so this would validate. The order it tries the match doesn't
really matter since it's all or nothing.
The reason why they need to be sorted with Daffodil's lengthPattern is
because Daffodil doesn't know where the end of the data is. We use the
pattern for scanning. So once a pattern comes back with a match (which
could be the first alternative) we stop scanning. We don't continue
scanning trying all regex alternatives to find the longest, for example.
On 8/5/22 5:50 PM, Mike Beckerle wrote:
Yes you do. All the regex engines I know are greedy.
Besides regexs just being fussy, this is the main reason DFDL has a delimiter
language that is it's own thing. Because the delimiters are specified in
different places, not all together as in a regex. Hence the user has no
opportunity to sort longest to shortest, so DFDL delimiters match all the
possible delimiters that can appear at a point with longest match preferred.
Il Ven 5 Ago 2022, 1:54 PM Roger L Costello <coste...@mitre.org
<mailto:coste...@mitre.org>> ha scritto:
Hi Folks,
Recall that when using dfdl:lengthPattern you must specify its regex
alternatives longest-to-shortest. For example, if you specify this:
dfdl:lengthPattern="abc|abcd"
then you will get a "left over data" error message.
So you must sort the alternatives in longest-to-shortest order. That is a
hassle.
The "-V limited" option changes things. It enables me to abandon
dfdl:lengthPattern and instead use the XSD pattern facet:
<simpleType>
<restriction base="string">
<pattern value="abc|abcd"/>
</restriction>
</simpleType>
Question: Do I need to sort the pattern facet alternatives in
longest-to-shortest order? I am hoping the answer is "no".
/Roger