Re: \{n\} are not recognized as repetition counter in regular expressions.
On Thu, May 22, 2025 at 12:30:37PM -0700, Paul Eggert wrote: > On 2025-05-22 10:42, Eric Blake wrote: > > BSD has tried hard to make their m4 be a drop-in replacement > > enough that autoconf will use it instead of mandating GNU m4 > > Is this a real problem, though? I just now tried running Autoconf's > 'configure' on FreeBSD 15, and FreeBSD 15's /usr/bin/m4 quickly failed > Autoconf's 'configure' test because it didn't support GNU m4's -F option. That's good to remember. I remember that when I checked years ago that the BSD version of m4 was even further away from GNU m4 (probably when I was more active in Autoconf, around 2008). Looking at the FreeBSD git changelog [1], I can see various spurts of development over time where the various BSD families have cross-pollinated their patches, such as adding some long option support (but not --help!) and 'm4 -G' in 2023. And this despite their man page stating things like: format(formatstring, arg1, ...) Returns formatstring with escape sequences substituted with arg1 and following arguments, in a way similar to printf(3). This built-in is only available in GNU-m4 com‐ patibility mode, and the only parameters implemented are there for autoconf compatibility: left-padding flag, an op‐ tional field width, a maximum field width, *-specified field widths, and the %s and %c data type. At any rate, you are right that without frozen files they are not yet a drop-in replacement for autoconf's needs (unless they also hack their local builds of autoconf to not use frozen files - at which point anyone motivated enough to improve two pieces of open source software to work with each other is entitled to do so). [1] https://cgit.freebsd.org/src/log/usr.bin/m4 -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org
Re: \{n\} are not recognized as repetition counter in regular expressions.
On 2025-05-22 10:42, Eric Blake wrote: BSD has tried hard to make their m4 be a drop-in replacement enough that autoconf will use it instead of mandating GNU m4 Is this a real problem, though? I just now tried running Autoconf's 'configure' on FreeBSD 15, and FreeBSD 15's /usr/bin/m4 quickly failed Autoconf's 'configure' test because it didn't support GNU m4's -F option.
Re: \{n\} are not recognized as repetition counter in regular expressions.
On Mon, Apr 07, 2025 at 01:37:10PM -0500, Eric Blake wrote:
> On Mon, Apr 07, 2025 at 12:46:11PM -0500, Eric Blake wrote:
> > And relying on a configure-time test of what m4 supports on the
> > packager's machine is not necessarily going to work when autoconf is
> > run on a developer's machine with a different version of m4. Which in
> > turn implies that it is desirable to be able to probe at runtime what
> > is supported, rather than being limited to a command-line switch. But
> > while it is easy to write a runtime probe on whether "regexp([{],
> > [\{1\}]) results in -1 (old m4, ergo \{ is a literal) or 0 (m4 that
> > has enabled repetition operator semantics),
>
> Correcting myself: the above would return -1 regardless of whether \{
> is literal or repetition (since a repetition is no good without an
> earlier sequence to repeat). If you want probe a -1 or 0 value as a
> witness of regex flavor, then the haystack and needle have to be
> something a bit smarter. As in: regexp([{{], [.\{1\}]) which returns
> -1 when \{ is literal and 0 when it is a repetition.
Adding some more weirdness to be aware of:
I've been playing with the FreeBSD 14 build of m4, which has 'm4 -g'
that tries to be GNU-alike (it's not perfect, but close). But among
other things:
# m4 -g
regexp({,\{)
0
regexp({,{)
m4: stdin at line 2: regular expression error in {: repetition-operator operand
invalid.
#
Urgh - BSD has tried hard to make their m4 be a drop-in replacement
enough that autoconf will use it instead of mandating GNU m4, but not
only is the hard exit (rather than a warning) undesirable, the fact
that they picked { rather than \{ for repetition in the emacs-like
syntax is backwards. (It's easy enough to patch autoconf's configure
to weed out broken m4 like this, at which point it will be an arms
race until BSD changes their m4 to comply again...)
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
Re: \{n\} are not recognized as repetition counter in regular expressions.
On Mon, Apr 07, 2025 at 10:51:26AM -0400, Zack Weinberg wrote:
> On Fri, Apr 4, 2025, at 3:56 PM, Eric Blake wrote:
> …
> > tl;dr: if I do add intervals to m4 regex, would you rather it be \{\}
> > (BRE and emacs style) or {} (ERE style)? And how to avoid breaking
> > existing m4 scripts?
>
> (Note: replying to chunks of your message out of original order.)
>
> With my autoconf hat on: The *safest* thing to do, I think, would be
> to leave the existing regex syntax strictly alone until a mechanism
> for specifying the regex syntax on a per-regex basis is available.
>
> If any changes are made to the existing syntax, I would strongly
> prefer that it be harmonized with POSIX BREs (i.e. ( + ? | { are
> all literals, \( \+ \? \| \{ are all operators) rather than with Emacs…
>
> > For some contrast, in BRE (POSIX basic regular expression), all of (,
> > +, ?, |, and { are literals, while \( is grouping, \{ is intevals, and
> > \+, \?, and \| are up to the implementation on whether they are
> > literal or meta. In ERE (POSIX extended regular expression), all of
> > (, +, ?, |, and { are operators, while \(, \{, \?, \+ and \{ are
> > literals (as seen in 'grep' vs. 'grep -E'). The all-or-none factor is
> > a convenience to remember; anything else feels like a disservice to
> > users, if the practice is not already long-standing.
>
> …because of this. As a user of BREs, EREs, *and* emacs regexps for
> going on thirty years now, emacs regexps are the worst of the three,
> because they don’t have the all-or-none characteristic. I wind up
> avoiding using + and ? at all in emacs regexps—neither as literals nor
> as operators—because I find it too difficult to remember which way
> around they work.
That resonates with me. It is always a pain to figure out "was my
failure to match anything because I typed the regex right and nothing
matched, or because I typod it wrong by adding or forgetting \ to the
point that the regex looked for something completely different from my
desires"; and then have to write another regex or two until finding
something that DOES match to remind myself of which spelling works,
before rewriting the original intended regex.
>
> Also, in Autoconf source code, where m4 is being used to generate a
> shell script, m4 regexps may well appear right next to regexps
> intended for use by grep, sed, awk, etc. Therefore it is desirable
> for m4 regexps to match the regexp syntax used by those tools, which
> is usually either POSIX BRE or ERE.
Interesting point, and one where diverging from emacs may make the
most sense. Too bad we already have \( but + in m4 regex (that mix
matches emacs, but like you say is harder to remember)
>
> > One of the ideas on the m4 2.0 branch (which is nowhere near usable)
> > was to let users control the regex syntax at runtime, rather than
> > hard-coding to a single syntax, and then expanding the manual to
> > document the different syntax choices possible. In addition to
> > choosing which characters are meta, there are per-match tweaks that
> > can be useful, such as the ability to choose whether newline or NUL
> > match ".", or whether the match is case-insensitive.
>
> Autoconf’s M4sugar layer anticipates this change; that’s why its
> prefixed names for the `regexp` and `patsubst` builtins are
> `m4_bregexp` and `m4_bpatsubst`. Long term I think it is desirable,
> but not as much as some of the other changes that have been stacked
> up for over a decade on various M4 development branches, like the
> linear-time $@ recursion work.
>
> My initial reaction to your message was that backward compatibility
> would require the hypothetical M4 2.0 to give us _new_ builtins for
> EREs, perhaps `eregexp` and `epatsubst`. While writing _this_ message
> it occurred to me that one could instead use flags embedded at the
> beginning of the regex itself, like Perl does with (?i) for case
> insensitive, (?x) for “expanded” notation (unescaped whitespace is not
> significant), etc. (Stealing a subset of the Perl (?…) extensions
> would be worthwhile _anyway_.) But the trouble with that idea is,
> right now “regexp(haystack, `(?i)NEEDLE’)” does a literal match.
> We can’t break that either! So maybe we _should_ have `eregexp`
> and `epatsubst` as the first stepping stone away from the old syntax.
> In ERE with no extensions ‘(?‘ is a syntax error, so it’s a safe
> extension point.
The current tentative plan in the m4 2.0 branch
was to add a new fourth argument, as in:
patsubst(haystack, needle, replacement, syntax)
where syntax (if present) has to be a recognized literal (such as
"emacs", "bre", "ere"). A single builtin like that could power both
m4_bpatsubst (accept three parameters but call the underlying builtin
with the fourth slammed to "bre") and m4_epatsubst (accept three
parameters but call the underlying with the fourth slammed to "ere").
But your idea of a leading sequence IN the needle parameter (rather
than a fourth parameter) is clever. I
Re: \{n\} are not recognized as repetition counter in regular expressions.
On Mon, Apr 07, 2025 at 12:46:11PM -0500, Eric Blake wrote:
> And relying on a configure-time test of what m4 supports on the
> packager's machine is not necessarily going to work when autoconf is
> run on a developer's machine with a different version of m4. Which in
> turn implies that it is desirable to be able to probe at runtime what
> is supported, rather than being limited to a command-line switch. But
> while it is easy to write a runtime probe on whether "regexp([{],
> [\{1\}]) results in -1 (old m4, ergo \{ is a literal) or 0 (m4 that
> has enabled repetition operator semantics),
Correcting myself: the above would return -1 regardless of whether \{
is literal or repetition (since a repetition is no good without an
earlier sequence to repeat). If you want probe a -1 or 0 value as a
witness of regex flavor, then the haystack and needle have to be
something a bit smarter. As in: regexp([{{], [.\{1\}]) which returns
-1 when \{ is literal and 0 when it is a repetition.
> it doesn't work if doing
> the probe itself triggers a warning when you have only opted in to the
> portability diagnosis rather than the new semantics.
This part is still true, if there is no runtime way to turn warnings
on or off independently of changing syntax.
> >
> > > lib/autoconf/general.m4:[m4_if(m4_bregexp([$1],
> > > [#\|\\\|`\|\(\$\|@S|@\)\((|{|@{:@\)]), [-1],
> >
> > > lib/m4sugar/m4sugar.m4:
> > > [@\(\(<:\|:>\|S|\|%:\|\{:\|:\}\)\(@\)\|&t@\)],
> >
> > be *indifferent* to whether { or \{ is a literal or an operator.
If you WANT a regex that is indifferent to { or \{ being a literal and
the other a metacharacter, you can always use "[{]" which is
guaranteed to be a literal regardless of the flavor of { and \{
outside [].
If you WANT a regex that expresses repetition, your only solutions are
"newer m4 where that regex is supported" or "spell out the repetitions
yourself: "a" instead of "a\{5\}", or "\([ab][ab][ab]?\)" instead
of "[ab]\{2,3\}". And since m4 is Turing complete, you could even
pre-process any regex with \{digit\} into a corresponding regex that
IS portable to older versions, although it may require adding yet
another layer of \(\) grouping and thus rewriting the substitution of
any \DIGIT back-refs, and the amount of work to process your regex
into something usable would make regex even slower.
Still, this makes me wonder - is a single warning good enough ("your
code has \{, but this changes semantics depending on m4 version"), or
do we want two orthogonal warnings? One that only occurs on \{ that
is blatantly not a repetition (ie. when the next character is neither
a digit nor comma, "this will fail to compile in the future; use bare
{ instead to continue to match literal"), and the other than only
occurs on something that looks like a repetition (regardless of
whether something else orthogonal can change the flavor of enabled or
disabled, the warning of "this regex is not portable to all versions
of m4, based on whether repetitions are enabled")
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
Re: \{n\} are not recognized as repetition counter in regular expressions.
On Fri, Apr 4, 2025, at 3:56 PM, Eric Blake wrote:
…
> tl;dr: if I do add intervals to m4 regex, would you rather it be \{\}
> (BRE and emacs style) or {} (ERE style)? And how to avoid breaking
> existing m4 scripts?
(Note: replying to chunks of your message out of original order.)
With my autoconf hat on: The *safest* thing to do, I think, would be
to leave the existing regex syntax strictly alone until a mechanism
for specifying the regex syntax on a per-regex basis is available.
If any changes are made to the existing syntax, I would strongly
prefer that it be harmonized with POSIX BREs (i.e. ( + ? | { are
all literals, \( \+ \? \| \{ are all operators) rather than with Emacs…
> For some contrast, in BRE (POSIX basic regular expression), all of (,
> +, ?, |, and { are literals, while \( is grouping, \{ is intevals, and
> \+, \?, and \| are up to the implementation on whether they are
> literal or meta. In ERE (POSIX extended regular expression), all of
> (, +, ?, |, and { are operators, while \(, \{, \?, \+ and \{ are
> literals (as seen in 'grep' vs. 'grep -E'). The all-or-none factor is
> a convenience to remember; anything else feels like a disservice to
> users, if the practice is not already long-standing.
…because of this. As a user of BREs, EREs, *and* emacs regexps for
going on thirty years now, emacs regexps are the worst of the three,
because they don’t have the all-or-none characteristic. I wind up
avoiding using + and ? at all in emacs regexps—neither as literals nor
as operators—because I find it too difficult to remember which way
around they work.
Also, in Autoconf source code, where m4 is being used to generate a
shell script, m4 regexps may well appear right next to regexps
intended for use by grep, sed, awk, etc. Therefore it is desirable
for m4 regexps to match the regexp syntax used by those tools, which
is usually either POSIX BRE or ERE.
> One of the ideas on the m4 2.0 branch (which is nowhere near usable)
> was to let users control the regex syntax at runtime, rather than
> hard-coding to a single syntax, and then expanding the manual to
> document the different syntax choices possible. In addition to
> choosing which characters are meta, there are per-match tweaks that
> can be useful, such as the ability to choose whether newline or NUL
> match ".", or whether the match is case-insensitive.
Autoconf’s M4sugar layer anticipates this change; that’s why its
prefixed names for the `regexp` and `patsubst` builtins are
`m4_bregexp` and `m4_bpatsubst`. Long term I think it is desirable,
but not as much as some of the other changes that have been stacked
up for over a decade on various M4 development branches, like the
linear-time $@ recursion work.
My initial reaction to your message was that backward compatibility
would require the hypothetical M4 2.0 to give us _new_ builtins for
EREs, perhaps `eregexp` and `epatsubst`. While writing _this_ message
it occurred to me that one could instead use flags embedded at the
beginning of the regex itself, like Perl does with (?i) for case
insensitive, (?x) for “expanded” notation (unescaped whitespace is not
significant), etc. (Stealing a subset of the Perl (?…) extensions
would be worthwhile _anyway_.) But the trouble with that idea is,
right now “regexp(haystack, `(?i)NEEDLE’)” does a literal match.
We can’t break that either! So maybe we _should_ have `eregexp`
and `epatsubst` as the first stepping stone away from the old syntax.
In ERE with no extensions ‘(?‘ is a syntax error, so it’s a safe
extension point.
(I would not be sad if I got PCRE regexps in M4 2.0 ;-)
>> > BTW, the fixed m4 passes 247 tests, skips 20 tests, and fails no tests
>> > on my system. Original m4 shows exactly the same results.
>
> That only means the testsuite does not cover both spellings of { and
> \{ to the current behavior of being literals. It doesn't tell you how
> many other scripts might break. So I at least tried to find possible
> problems.
Thanks for looking. It’s a risky change even with all the issues in
Autotools proper addressed, because there is _so much_ badly written
third-party autoconf scripting out there. Once we have a coherent
set of patches for m4+autotools I think we will need to ask for a
test archive rebuild on one of the big Linux distributions, and
based on how that goes we might need to scrap the idea.
If you send a fully worked out patch for M4 1.4.x to the
autoconf-patches mailing list I will undertake to run the Autoconf and
Automake test suites with that patch applied (and something done about
the issues you already found). Libtool will also need testing but I
do not remember whether it has a very good testsuite itself.
> At BEST, all I could do for m4 1.4.20 is to add a new tri-state
> command-line switch; default is existing behavior (both spellings are
> silently literals), a warning mode (add a warning of the \{ spelling
> is encountered, but still treat it as a literal), or enabled (\{ works
> for
Re: \{n\} are not recognized as repetition counter in regular expressions.
Autoconf developers: see below for a bug report on _AC_DEFINE_UNQUOTED
Gnulib developers, maybe you have an opinion on why regex.h
documentation disagrees with reality?
tl;dr: if I do add intervals to m4 regex, would you rather it be \{\}
(BRE and emacs style) or {} (ERE style)? And how to avoid breaking
existing m4 scripts?
On Fri, Apr 04, 2025 at 08:47:20AM -0500, Eric Blake wrote:
> On Fri, Nov 04, 2022 at 04:25:45AM +0300, Van de Bugger wrote:
> > M4 documentation for regular expressions is extremely short:
> > https://www.gnu.org/software/m4/manual/html_node/Regexp.html
> > No regular expression syntax is explained, it just refers to GNU Emacs
> > Manual. In turn, GNU Emacs Manual:
> > https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html
> > states that \{n\} is repetition counter:
> >
> > > For example, ‘x\{4\}’ matches the string ‘’ and nothing else.
> >
> > However, m4 recognizes neither \{n\} nor {n}.
>
> This has bothered me too, over the years.
>
> Even though it would be a new feature to enable \{\}, I can't see how
> portable GNU m4 programs would have been relying on that matching
> literal left curly brace. I'm seriously thinking about turning this
> feature on for 1.4.20.
I'm waffling. Reading regex.h (from gnulib) clearly says that the
'emacs' mode of re_set_syntax (when re_syntax_options == 0) recognizes
bare + and ? as operators, that RE_INTERVALS==0 means both { and \{
are literals, and that RE_NO_BK_BRACES controls whether { or \{ is
magic but only when RE_INTERVALS is set.
But clearly, emacs supports \{ intervals, even though a grep of
emacs.git does not find any hits for RE_INTERVALS outside of
src/regex.[ch]. So either the comments in regex.h about 0 being emacs
syntax is wrong, or I'm totally missing how emacs supports intervals.
For some contrast, in BRE (POSIX basic regular expression), all of (,
+, ?, |, and { are literals, while \( is grouping, \{ is intevals, and
\+, \?, and \| are up to the implementation on whether they are
literal or meta. In ERE (POSIX extended regular expression), all of
(, +, ?, |, and { are operators, while \(, \{, \?, \+ and \{ are
literals (as seen in 'grep' vs. 'grep -E'). The all-or-none factor is
a convenience to remember; anything else feels like a disservice to
users, if the practice is not already long-standing.
emacs syntax is closer to BRE (in that grouping uses "\(" rather than
"(", but not quite (in that "+" rather than "\+" is meta). And since
POSIX does not mandate regex in m4 at all, having GNU M4 be more like
emacs (rather than more like BRE or ERE) is a good goal. The fact
that emacs uses "\{" for intervals works in our favor - since we
already require \( and \|, \{ feels more like BRE (although we do have
+ rather than \+).
One of the ideas on the m4 2.0 branch (which is nowhere near usable)
was to let users control the regex syntax at runtime, rather than
hard-coding to a single syntax, and then expanding the manual to
document the different syntax choices possible. In addition to
choosing which characters are meta, there are per-match tweaks that
can be useful, such as the ability to choose whether newline or NUL
match ".", or whether the match is case-insensitive.
>
> > I was able to build m4 from the sources. I found regcomp.c file with
> > the regular expression compiler. The regular expression syntax is
> > controlled by re_syntax_options file scope variable, which can be set
> > by re_set_syntax function. In my experiments, re_set_syntax is not
> > called, re_syntax_options is always zero, so braces are treated
> > literally.
> >
> > If I initialize re_syntax_options to RE_INTERVALS, e. g.:
> >
> > reg_syntax_t re_syntax_options = RE_INTERVALS;
Yes, setting that variable (or calling re_set_syntax()) would be how
to do it.
> >
> > m4 recognizes \{n\} as repetition counter.
> >
> > Thus, m4 is able to recognize \{n\} as repetition counter, but (for
> > unknown to me reason) this feature is disabled. I failed to trace it
> > further.
> >
> > m4 manual does not document if this feature can be enabled or disabled
> > at build or run time, so I assume it should be enabled, as the \{n\}
> > construct is documented in GNU Emacs Manual, referred by GNU m4 Manual.
> >
> > BTW, the fixed m4 passes 247 tests, skips 20 tests, and fails no tests
> > on my system. Original m4 shows exactly the same results.
That only means the testsuite does not cover both spellings of { and
\{ to the current behavior of being literals. It doesn't tell you how
many other scripts might break. So I at least tried to find possible
problems.
A quick grep of autoconf source finds at least:
lib/autoconf/general.m4:[m4_if(m4_bregexp([$1],
[#\|\\\|`\|\(\$\|@S|@\)\((|{|@{:@\)]), [-1],
in the definition of _AC_DEFINE_UNQUOTED. That regex is asking for a
literal "#", "\", "`", or the concatenation of [either "$" or the
quadrigraph "@S|@" (an alias for "$")] with [literal "(|{|"
concatenated with the quadrigra
