Hi Mike, You wrote:
* ((ab|abb)(cd|cdd)) will not match abbcdd Not true. I tested that and it works fine, both for parsing and unparsing. * But ((abb|ab)(cdd|cd)) will match it. I tested that as well and it also works fine, both for parsing and unparsing. Conclusion: There is no need to sort alternatives in an XSD pattern facet. This is terrific news! This makes things so much easier – just use the XSD pattern facet as is. /Roger From: Mike Beckerle <mbecke...@apache.org> Sent: Tuesday, August 9, 2022 10:30 AM To: users@daffodil.apache.org Subject: [EXT] Re: Do I need to sort the xs:pattern regex alternatives longest-to-shortest? I *think* in a pattern facet, if you nest choices of alternatives you can create the situation where it will not backup to try an alternative on an inner choice because that decision is made greedily. E.g., ((ab|abb)(cd|cdd)) will not match abbcdd But ((abb|ab)(cdd|cd)) will match it. Not all regex engines are greedy. The one in XSD is (supposedly), as is the one in Java and DFDL/Daffodil. However the POSIX regex semantics requires longest matches. So the above first regex would match on POSIX regex. This is one of the big problems with regex. Despite POSIX standardization, variants have proliferated endlessly. So we are stuck with the minimum functionality, which is greedy matches. On Tue, Aug 9, 2022 at 7:40 AM Steve Lawrence <slawre...@apache.org<mailto:slawre...@apache.org>> wrote: I thought restriction pattern facet alternatives don't need to be sorted? This is because the pattern must match the entire value of the element--there's essentially an implied ^ and $ surrounding the pattern. So for example, if the pattern facet was "(ab|abc.*)" and an infoset of <foo>abcxyz</foo> The first alternative would match the start of the value, but not the entire value. And so the second alternative would be tried and would match, so this would validate. The order it tries the match doesn't really matter since it's all or nothing. The reason why they need to be sorted with Daffodil's lengthPattern is because Daffodil doesn't know where the end of the data is. We use the pattern for scanning. So once a pattern comes back with a match (which could be the first alternative) we stop scanning. We don't continue scanning trying all regex alternatives to find the longest, for example. On 8/5/22 5:50 PM, Mike Beckerle wrote: > Yes you do. All the regex engines I know are greedy. > > Besides regexs just being fussy, this is the main reason DFDL has a delimiter > language that is it's own thing. Because the delimiters are specified in > different places, not all together as in a regex. Hence the user has no > opportunity to sort longest to shortest, so DFDL delimiters match all the > possible delimiters that can appear at a point with longest match preferred. > > > > Il Ven 5 Ago 2022, 1:54 PM Roger L Costello > <coste...@mitre.org<mailto:coste...@mitre.org> > <mailto:coste...@mitre.org<mailto:coste...@mitre.org>>> ha scritto: > > Hi Folks, > > Recall that when using dfdl:lengthPattern you must specify its regex > alternatives longest-to-shortest. For example, if you specify this: > > dfdl:lengthPattern="abc|abcd" > > then you will get a "left over data" error message. > > So you must sort the alternatives in longest-to-shortest order. That is a > hassle. > > The "-V limited" option changes things. It enables me to abandon > dfdl:lengthPattern and instead use the XSD pattern facet: > > <simpleType> > <restriction base="string"> > <pattern value="abc|abcd"/> > </restriction> > </simpleType> > > Question: Do I need to sort the pattern facet alternatives in > longest-to-shortest order? I am hoping the answer is "no". > > /Roger >