Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-07-10 Thread Paul Eggert

On 2025-07-10 03:31, James Youngman wrote:

Do we need to notify the Emacs maintainers about what is happening (or
did happen)?


No, as they inspired the Gnulib change in the first place. See 
.





Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-07-10 Thread James Youngman
Do we need to notify the Emacs maintainers about what is happening (or
did happen)?



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-07-09 Thread Paul Eggert

On 2025-07-09 17:38, Collin Funk wrote:

I assume it will have to wait until the next release.


Yes, at least until then.



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-07-09 Thread Bernhard Voelker

On 7/10/25 02:38, Collin Funk wrote:

Bernhard Voelker  writes:


2. Commit (to be pushed) to gnulib - see attachment.

Good to push?


That patch looks good, thanks. I confirmed that I receive the same
output when running './regexprops "Regular Expressions" generic'.


Thanks for the review, pushed.

Have a nice day,
Berny



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-07-09 Thread Collin Funk
Bernhard Voelker  writes:

> 2. Commit (to be pushed) to gnulib - see attachment.
>
> Good to push?

That patch looks good, thanks. I confirmed that I receive the same
output when running './regexprops "Regular Expressions" generic'.

I'll send you a patch later to fix 'make syntax-check' after the Gnulib
update in Findutils.

Paul, just to confirm, the goal is to change RE_SYNTAX_EMACS in glibc? I
think that is what you meant here [1]:

>>   - Will they change the value of RE_SYNTAX_EMACS? Or can't they do this,
>> because that would break backward compatibility?
>
> The idea is to change the value of RE_SYNTAX_EMACS, yes.

I assume it will have to wait until the next release. But it feels a bit
strange to have RE_SYNTAX_EMACS not match what current Emacs does, and
not match the updated documentation.

Thanks,
Collin

[1] https://lists.gnu.org/archive/html/bug-gnulib/2025-04/msg00090.html



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-07-09 Thread Bernhard Voelker

[+bug-findutils]

Hi Collin,

On 7/9/25 04:17, Collin Funk wrote:

Bernhard, I built findutils with GNULIB_SRCDIR set to my local
clone. This uses the latest Gnulib commit instead of the one specified
by the submodule.

This patch causes the following 'make check' fail in findutils:

 ./../doc/regexprops.texi /tmp/check-regexprops.wUz52k differ: char 1649, 
line 45
 ./../doc/regexprops.texi is out of date.
 Updated output is saved in regexprops.texi.new
 FAIL check-regexprops (exit status: 1)


thanks for reporting this ... I incidentally saw this last weekend as well,
but couldn't get around fixing it yet.

> But with this patch RE_SYNTAX_EMACS is changed. A diff of the generated
> documentation confirms this.
>
> What is the proper way to fix this? My thinking is to first update the
> findutils uses and copy the regexprops.texi.new to regexprops.texi,
> since the new value of RE_SYNTAX_EMACS is more correct based on this
> thread. This file will also have to be copied to Gnulib's
> doc/regexprops-generic.texi, if I understand correctly.

Correct.

1. Commits already pushed to findutils:

* [PATCH 1/3] maint: update gnulib to latest
  https://cgit.git.sv.gnu.org/cgit/findutils.git/commit/?id=c7f5ff1ed88

* [PATCH 2/3] regexprops: sort regex_map alphabetically
  https://cgit.git.sv.gnu.org/cgit/findutils.git/commit/?id=c9c2c511759

* [PATCH 3/3] doc: regenerate regexprops.texi
  https://cgit.git.sv.gnu.org/cgit/findutils.git/commit/?id=facc27e1804

2. Commit (to be pushed) to gnulib - see attachment.

Good to push?

Have a nice day,
Berny

From f3aaeaf5e2d1cbbbd8c90c4389e7204aa079fdcb Mon Sep 17 00:00:00 2001
From: Bernhard Voelker 
Date: Wed, 9 Jul 2025 21:06:12 +0200
Subject: [PATCH] regexprops-generic: update from regex.h

* doc/regexprops-generic.texi: Re-generate by running the 'regexprops'
binary from GNU findutils:
  ./regexprops "Regular Expressions" generic
At least the recent(ish) change (efd5c380ff) to regex.h aligning
gnulib with Emacs behavior had made this document out-of-date.
Reported by Collin Funk in
.
Additionally, today's findutils commit c9c2c51175 fixed the sort order
of the Texinfo nodes.
---
 ChangeLog   |  13 ++
 doc/regexprops-generic.texi | 228 
 2 files changed, 141 insertions(+), 100 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 7913e25423..f8d0053181 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,16 @@
+2025-07-09  Bernhard Voelker  
+
+	regexprops-generic: update from regex.h
+	* doc/regexprops-generic.texi: Re-generate by running the 'regexprops'
+	binary from GNU findutils:
+	  ./regexprops "Regular Expressions" generic
+	At least the recent(ish) change (efd5c380ff) to regex.h aligning
+	gnulib with Emacs behavior had made this document out-of-date.
+	Reported by Collin Funk in
+	.
+	Additionally, today's findutils commit c9c2c51175 fixed the sort order
+	of the Texinfo nodes.
+
 2025-07-08  Paul Eggert  
 
 	float-h: work around GCC bug 120993
diff --git a/doc/regexprops-generic.texi b/doc/regexprops-generic.texi
index 6de54abda3..9da39526e1 100644
--- a/doc/regexprops-generic.texi
+++ b/doc/regexprops-generic.texi
@@ -1,18 +1,18 @@
-@c Copyright (C) 1994, 1996, 1998, 2000--2001, 2003--2007, 2009--2025 Free
-@c Software Foundation, Inc.
+@c Copyright (C) 1994--2025 Free Software Foundation, Inc.
 @c
 @c Permission is granted to copy, distribute and/or modify this document
 @c under the terms of the GNU Free Documentation License, Version 1.3 or
 @c any later version published by the Free Software Foundation; with no
-@c Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A
-@c copy of the license is at .
+@c Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the ``GNU Free
+@c Documentation License'' file as part of this distribution.
 
 @c this regular expression description is for: generic
 
 @menu
 * awk regular expression syntax::
-* egrep regular expression syntax::
 * ed regular expression syntax::
+* egrep regular expression syntax::
 * emacs regular expression syntax::
 * gnu-awk regular expression syntax::
 * grep regular expression syntax::
@@ -46,21 +46,24 @@ matches a @samp{?}.
 
 Bracket expressions are used to match ranges of characters.  Bracket expressions where the range is backward, for example @samp{[z-a]}, are invalid.  Within square brackets, @samp{\} can be used to quote the following character.  Character classes are supported; for example @samp{[[:digit:]]} will match a single decimal digit.
 
+
 GNU extensions are not supported and so @samp{\w}, @samp{\W}, @samp{\<}, @samp{\>}, @samp{\b}, @samp{\B}, @samp{\`}, and @samp{\'} match @samp{w}, @samp{W}, @samp{<}, @samp{>}, @samp{b}, @samp{B}, @samp{`}, and @samp{'} respe

Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-07-08 Thread Collin Funk
Hi Paul,

Trimming CC and adding Bernhard and [email protected] in case anyone
else wants to add input.

Paul Eggert  writes:

> Thanks; I installed the attached somewhat fancier patch into Gnulib.
>
> From efd5c380ff8062541d5fd98b050ecd3cb295917c Mon Sep 17 00:00:00 2001
> From: Paul Eggert 
> Date: Sun, 13 Apr 2025 18:01:08 -0700
> Subject: [PATCH] regex: match current Emacs behavior
>
> * config/srclist.txt: Comment out regex.h, since we now
> disagree with glibc.
> * lib/regex.h (RE_SYNTAX_EMACS):
> Match Emacs 21+ behavior, not Emacs 20-.
> * m4/regex.m4 (gl_REGEX): Check for this Emacs fix.
> ---
>  ChangeLog  | 9 +
>  config/srclist.txt | 2 +-
>  doc/regex.texi | 3 ++-
>  lib/regex.h| 8 
>  m4/regex.m4| 6 +-
>  5 files changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/ChangeLog b/ChangeLog
> index 2ea548d13f..c20c151757 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,3 +1,12 @@
> +2025-04-13  Paul Eggert  
> +
> + regex: match current Emacs behavior
> + * config/srclist.txt: Comment out regex.h, since we now
> + disagree with glibc.
> + * lib/regex.h (RE_SYNTAX_EMACS):
> + Match Emacs 21+ behavior, not Emacs 20-.
> + * m4/regex.m4 (gl_REGEX): Check for this Emacs fix.
> +
>  2025-04-13  Bruno Haible  
>  
>   getlogin_r tests: Avoid writing to a literal string.
> diff --git a/config/srclist.txt b/config/srclist.txt
> index 173f23edaf..62816dcf4a 100644
> --- a/config/srclist.txt
> +++ b/config/srclist.txt
> @@ -68,7 +68,7 @@ $LIBCSRC malloc/scratch_buffer_set_array_size.c 
> lib/malloc
>  #$LIBCSRC misc/sys/cdefs.h   lib
>  #$LIBCSRC posix/regcomp.clib
>  $LIBCSRC posix/regex.c   lib
> -$LIBCSRC posix/regex.h   lib
> +#$LIBCSRC posix/regex.h  lib
>  #$LIBCSRC posix/regex_internal.c lib
>  #$LIBCSRC posix/regex_internal.h lib
>  #$LIBCSRC posix/regexec.clib
> diff --git a/doc/regex.texi b/doc/regex.texi
> index cba1e13520..925b0db639 100644
> --- a/doc/regex.texi
> +++ b/doc/regex.texi
> @@ -316,7 +316,8 @@ regular expressions.
>  The predefined syntaxes---taken directly from @file{regex.h}---are:
>  
>  @smallexample
> -#define RE_SYNTAX_EMACS 0
> +# define RE_SYNTAX_EMACS\
> +  (RE_CHAR_CLASSES | RE_INTERVALS)
>  
>  #define RE_SYNTAX_AWK   \
>(RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL   \
> diff --git a/lib/regex.h b/lib/regex.h
> index 67a3aa70a5..c4c6089a8c 100644
> --- a/lib/regex.h
> +++ b/lib/regex.h
> @@ -66,9 +66,8 @@ typedef unsigned long int active_reg_t;
>  
>  /* The following bits are used to determine the regexp syntax we
> recognize.  The set/not-set meanings are chosen so that Emacs syntax
> -   remains the value 0.  The bits are given in alphabetical order, and
> -   the definitions shifted by one from the previous bit; thus, when we
> -   add or remove a bit, only one other definition need change.  */
> +   is the value 0 for Emacs 20 (2000) and earlier, and the value
> +   RE_SYNTAX_EMACS for Emacs 21 (2001) and later.  */
>  typedef unsigned long int reg_syntax_t;
>  
>  #ifdef __USE_GNU
> @@ -215,7 +214,8 @@ extern reg_syntax_t re_syntax_options;
> (The [[[ comments delimit what gets put into the Texinfo file, so
> don't delete them!)  */
>  /* [[[begin syntaxes]]] */
> -# define RE_SYNTAX_EMACS 0
> +# define RE_SYNTAX_EMACS \
> +  (RE_CHAR_CLASSES | RE_INTERVALS)
>  
>  # define RE_SYNTAX_AWK   
> \
>(RE_BACKSLASH_ESCAPE_IN_LISTS   | RE_DOT_NOT_NULL  \
> diff --git a/m4/regex.m4 b/m4/regex.m4
> index 80dfb8e1e5..1b2012fe00 100644
> --- a/m4/regex.m4
> +++ b/m4/regex.m4
> @@ -1,5 +1,5 @@
>  # regex.m4
> -# serial 78
> +# serial 79
>  dnl Copyright (C) 1996-2001, 2003-2025 Free Software Foundation, Inc.
>  dnl This file is free software; the Free Software Foundation
>  dnl gives unlimited permission to copy and/or distribute it,
> @@ -53,6 +53,10 @@ AC_DEFUN([gl_REGEX],
>  /* Exit with distinguishable exit code.  */
>  static void sigabrt_no_core (int sig) { raise (SIGTERM); }
>  #endif
> +
> +#if RE_SYNTAX_EMACS != (RE_CHAR_CLASSES | RE_INTERVALS)
> +# error "RE_SYNTAX_EMACS does not match Emacs behavior"
> +#endif
>]],
>[[int result = 0;
>  static struct re_pattern_buffer regex;

Bernhard, I built findutils with GNULIB_SRCDIR set to my local
clone. This uses the latest Gnulib commit instead of the one specified
by the submodule.

This patch causes the following 'make check' fail in findutils:

./../doc/regexprops.texi /tmp/check-regexprops.wUz52k differ: char 1649, 
line 45
./../doc/regexprops.texi is out of date.
Upda

Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-16 Thread Eric Blake
On Wed, Apr 16, 2025 at 05:45:46PM +0200, Bruno Haible wrote:
> Eric Blake wrote:
> > If you want a quicker table, I can attempt to provide one (note that
> > POSIX BRE do not actually have to support \+ \? or \|; but glibc and
> > gnulib's implementation does):
> > 
> >feature   0-or-more 1-or-more 1-or-0 grouping/alternation intervals 
> > charclasses
> > syntax
> > 0 (old emacs) *   +?   \( \| \)n/a  n/a
> > emacs *   +?   \( \| \)   \{ \}
> > [[:...:]]
> > posix-basic   *  \+   \?   \( \| \)   \{ \}
> > [[:...:]]
> > posix-extended*   +?(  |  ){  }
> > [[:...:]]
> 
> This is a very nice table. How about adding it to the Gnulib documentation?
> 
> 
> I tried to reduce the width by transposing the table, but it is not
> as pretty as in the way you wrote it:
> 
> syntax 0 (old emacs)   emacs   posix-basic  posix-extended
> feature
> 0-or-more * *   **
> 1-or-more + +   \+   +
> 1-or-0? ?   \?   ?
> grouping/alternation  \( \| \)  \( \| \)\( \| \) ( | )
> intervals n/a   \{ \}   \{ \}{ }
> charclasses   n/a   [[:...:]]   [[:...:]][[:...:]]

It may also be worth listing other flavors: regex.h also documents
gnu-awk, posix-awk, grep, and egrep flavors (at which point you have
to consider whether there are more rows or columns on whether the
transposition makes sense).  Admittedly, these other flavors tend to
be refinements of BRE and ERE, where the features of this table are
still present, but where other knobs are fine-tuned (for example,
whether . matches newline or NUL, whether ^ forms an anchor inside of
a group, whether a mis-spelled interval silently parses successfully
as a literal or causes a failure to compile, and so forth) - at which
point we could also list more features in the table.

One other problem: recent POSIX has added support for non-greedy
repetition to ERE; it looks like neither glibc nor gnulib's regex has
implemented that yet, but it is an awesome feature (although the POSIX
wording is still in flux:
https://www.austingroupbugs.net/view.php?id=1857).  "For example, the
ERE "([ab]{6}|a)*?b" matches the first five characters of the string
"" as this is the shortest for the minimal repetition
"*?". Matching with the least repetitions would match the first seven
characters by using one repetition of "[ab]{6}" instead of four
repetitions of "a". This distinction is only possible because the
alternatives in an ERE alternation are chosen according to which gives
the longest (or shortest) match."

On a system with glibc 2.39 and grep 3.11, I can demonstrate that we
don't yet implement the new POSIX requirement on non-greedy operators,
since:

$ echo  | grep --color -E '([ab]{6}|a)*?b'


colors all 8 bytes rather than just the first 5, as POSIX wants.  It's
also easy to see that REG_MINIMAL is not yet defined in regex.h, which
is essential to the implementation of ??, *?, +?, and {...}?
non-greedy ERE operators (when REG_MINIMAL is specified to regcomp(),
all repetitions without a ? suffix become minimal and the presence of
the trailing ? swaps to maximal).  So if nothing else, gnulib's
doc/posix-functions/regcomp.texi should probably mention the newer
POSIX features that we don't implement yet.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-16 Thread Bruno Haible via Gnulib discussion list
Eric Blake wrote:
> If you want a quicker table, I can attempt to provide one (note that
> POSIX BRE do not actually have to support \+ \? or \|; but glibc and
> gnulib's implementation does):
> 
>feature   0-or-more 1-or-more 1-or-0 grouping/alternation intervals 
> charclasses
> syntax
> 0 (old emacs) *   +?   \( \| \)n/a  n/a
> emacs *   +?   \( \| \)   \{ \}
> [[:...:]]
> posix-basic   *  \+   \?   \( \| \)   \{ \}
> [[:...:]]
> posix-extended*   +?(  |  ){  }
> [[:...:]]

This is a very nice table. How about adding it to the Gnulib documentation?


I tried to reduce the width by transposing the table, but it is not
as pretty as in the way you wrote it:

syntax 0 (old emacs)   emacs   posix-basic  posix-extended
feature
0-or-more * *   **
1-or-more + +   \+   +
1-or-0? ?   \?   ?
grouping/alternation  \( \| \)  \( \| \)\( \| \) ( | )
intervals n/a   \{ \}   \{ \}{ }
charclasses   n/a   [[:...:]]   [[:...:]][[:...:]]


Bruno






Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-16 Thread Nikolaos Chatzikonstantinou
On Wed, Apr 16, 2025, 10:22 AM Eric Blake  wrote:

> On Tue, Apr 15, 2025 at 06:34:35PM -0400, Nikolaos Chatzikonstantinou
> wrote:
> > > Since there's already another long thread on how m4 does not match
> > > current emacs regex but why enabling intervals would break at least
> > > autoconf 2.72, I'm inclined to update the m4 manual rather than use
> > > RE_SYNTAX_EMACS, whether or not this patch is accepted.
> >
> > I'm having a bit of issue following, but this is relevant to me, so
> > I'd like to ask the following questions:
> >
> > 1)  has two interfaces, the old glibc one that gnulib
> > implements and the POSIX one with regcomp() and regexec(). What I've
> > noticed is inconsistency between the two interfaces in syntax:
> >
> > # m4 regexp that matches:
> > regexp(`foo', `[a-z]+')
> >
> > This will not match with POSIX:
> >
> > regcomp(&re, "[a-z]+", 0);
> > assert(regexec(&re, "foo", 0, NULL, 0) == REG_NOMATCH);
> >
> > The reason is that POSIX BRE wants [a-z]\+ instead. So the question
> > is, does this mean the two interfaces have incompatible syntaxes?
>
> m4 uses re_compile_pattern() with syntax 0 (which at one point used to
> be RE_SYNTAX_EMACS, but this thread shows that is no longer the case).
> regcomp() is a POSIX interface, but it basically forces a syntax of
> either RE_SYNTAX_POSIX_EXTENDED or RE_SYNTAX_POSIX_BASIC.
>
> The re_compile_pattern() interface is superior: it offers greater
> flexibility to the user, and is a superset of the regcomp() interface
> (which can only choose between two syntax levels, rather than the
> wider range of re_compile_pattern() syntaxes and individual feature
> knobs).
>
> > I
> > don't think that's clarified in either the glibc manual
> > <
> https://www.gnu.org/software/libc/manual/html_node/Regular-Expressions.html
> >
> > and gnulib's
> > <
> https://www.gnu.org/software/gnulib/manual/html_node/The-Backslash-Character.html
> >.
> > Perhaps
> > gnulib should be agnostic of this issue (although worth a mention?)
> > but certainly glibc should mention it.
>
> Gnulib does have a way to list ALL of the regex flavors; the
> regexprops-generic module creates doc/regexprops-generic.texi as a
> drop-in chapter to any larger project's manual that exposes the choice
> of syntax to the end user.  And GNU findutils does just that (you can
> use 'find --regextype=...' with 'emacs', 'posix-awk', 'posix-basic',
> 'posix-egrep', 'posix-extended'):
>
> https://www.gnu.org/software/findutils/manual/html_mono/find.html#Regular-Expressions
>
> This thread deals with the fact that 'emacs' syntax has changed over
> the years (prior to 2001, it did not have intervals or character
> classes; nowadays emacs has those but programs using syntax 0 like m4
> do not).
>
> And one thought is that a future m4 may also expose the ability to
> choose syntax from this same set.
>
> Meanwhile, I have already patched the upcoming GNU m4 1.4.20 manual to
> be a bit more specific about the syntax it does support, without
> changing the syntax (1.4.x should remain backwards-compatible; any
> changes to syntax or the ability to let the user control syntaxes
> rather than a single syntax being hard-coded would be new to 1.6 or
> 2.0).
> https://git.sv.gnu.org/cgit/m4.git/commit/?h=branch-1.4&id=c8a6346c
>
> >
> > 2) Is there going to be a change planned in either gnulib, glibc, or
> > m4 in terms of the regex syntax? If m4 breaks backwards compatibility,
> > how will all the m4 scripts be fixed? Isn't it nontrivial?
>
> The current discussion is on fixing gnulib so that 'emacs' syntax and
> syntax 0 are no longer synonymous (ie., make 'emacs' syntax actually
> match what emacs has done since 2001); this fix is currently
> independent of glibc, although glibc will likely be changed soon and
> gnulib go back to mirroring glibc.
>
> Changing m4 syntax is not trivial.  That's why m4 1.4.20 will still be
> syntax 0 (no change), but will attempt to document the situation
> better.  I'm struggling to even figure out how to make m4 make it easy
> to diagnose scripts that use \{ non-portably, so that it becomes
> possible to opt-in to warnings about a regex that may compile
> differently in the future (alas, m4's debugmode() builtin macro is not
> yet easily extensible, and changing that also risks
> backwards-compatibility headaches).
>
> >
> > 3) What syntax does m4 follow after all? Should it be called the Emacs
> > syntax or will that passage be changed from the manual?
>
> That passage will be changed for 1.4.20 (see above).  It is the
> pre-2001 emacs syntax, aka syntax 0.
>
> If you want a quicker table, I can attempt to provide one (note that
> POSIX BRE do not actually have to support \+ \? or \|; but glibc and
> gnulib's implementation does):
>
>feature   0-or-more 1-or-more 1-or-0 grouping/alternation intervals
> charclasses
> syntax
> 0 (old emacs) *   +?   \( \| \)n/a  n/a
> emacs *   +?   \( \| \)

Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-16 Thread Eric Blake
On Tue, Apr 15, 2025 at 06:34:35PM -0400, Nikolaos Chatzikonstantinou wrote:
> > Since there's already another long thread on how m4 does not match
> > current emacs regex but why enabling intervals would break at least
> > autoconf 2.72, I'm inclined to update the m4 manual rather than use
> > RE_SYNTAX_EMACS, whether or not this patch is accepted.
> 
> I'm having a bit of issue following, but this is relevant to me, so
> I'd like to ask the following questions:
> 
> 1)  has two interfaces, the old glibc one that gnulib
> implements and the POSIX one with regcomp() and regexec(). What I've
> noticed is inconsistency between the two interfaces in syntax:
> 
> # m4 regexp that matches:
> regexp(`foo', `[a-z]+')
> 
> This will not match with POSIX:
> 
> regcomp(&re, "[a-z]+", 0);
> assert(regexec(&re, "foo", 0, NULL, 0) == REG_NOMATCH);
> 
> The reason is that POSIX BRE wants [a-z]\+ instead. So the question
> is, does this mean the two interfaces have incompatible syntaxes?

m4 uses re_compile_pattern() with syntax 0 (which at one point used to
be RE_SYNTAX_EMACS, but this thread shows that is no longer the case).
regcomp() is a POSIX interface, but it basically forces a syntax of
either RE_SYNTAX_POSIX_EXTENDED or RE_SYNTAX_POSIX_BASIC.

The re_compile_pattern() interface is superior: it offers greater
flexibility to the user, and is a superset of the regcomp() interface
(which can only choose between two syntax levels, rather than the
wider range of re_compile_pattern() syntaxes and individual feature
knobs).

> I
> don't think that's clarified in either the glibc manual
> 
> and gnulib's
> .
> Perhaps
> gnulib should be agnostic of this issue (although worth a mention?)
> but certainly glibc should mention it.

Gnulib does have a way to list ALL of the regex flavors; the
regexprops-generic module creates doc/regexprops-generic.texi as a
drop-in chapter to any larger project's manual that exposes the choice
of syntax to the end user.  And GNU findutils does just that (you can
use 'find --regextype=...' with 'emacs', 'posix-awk', 'posix-basic',
'posix-egrep', 'posix-extended'):
https://www.gnu.org/software/findutils/manual/html_mono/find.html#Regular-Expressions

This thread deals with the fact that 'emacs' syntax has changed over
the years (prior to 2001, it did not have intervals or character
classes; nowadays emacs has those but programs using syntax 0 like m4
do not).

And one thought is that a future m4 may also expose the ability to
choose syntax from this same set.

Meanwhile, I have already patched the upcoming GNU m4 1.4.20 manual to
be a bit more specific about the syntax it does support, without
changing the syntax (1.4.x should remain backwards-compatible; any
changes to syntax or the ability to let the user control syntaxes
rather than a single syntax being hard-coded would be new to 1.6 or
2.0).
https://git.sv.gnu.org/cgit/m4.git/commit/?h=branch-1.4&id=c8a6346c

> 
> 2) Is there going to be a change planned in either gnulib, glibc, or
> m4 in terms of the regex syntax? If m4 breaks backwards compatibility,
> how will all the m4 scripts be fixed? Isn't it nontrivial?

The current discussion is on fixing gnulib so that 'emacs' syntax and
syntax 0 are no longer synonymous (ie., make 'emacs' syntax actually
match what emacs has done since 2001); this fix is currently
independent of glibc, although glibc will likely be changed soon and
gnulib go back to mirroring glibc.

Changing m4 syntax is not trivial.  That's why m4 1.4.20 will still be
syntax 0 (no change), but will attempt to document the situation
better.  I'm struggling to even figure out how to make m4 make it easy
to diagnose scripts that use \{ non-portably, so that it becomes
possible to opt-in to warnings about a regex that may compile
differently in the future (alas, m4's debugmode() builtin macro is not
yet easily extensible, and changing that also risks
backwards-compatibility headaches).

> 
> 3) What syntax does m4 follow after all? Should it be called the Emacs
> syntax or will that passage be changed from the manual?

That passage will be changed for 1.4.20 (see above).  It is the
pre-2001 emacs syntax, aka syntax 0.

If you want a quicker table, I can attempt to provide one (note that
POSIX BRE do not actually have to support \+ \? or \|; but glibc and
gnulib's implementation does):

   feature   0-or-more 1-or-more 1-or-0 grouping/alternation intervals 
charclasses
syntax
0 (old emacs) *   +?   \( \| \)n/a  n/a
emacs *   +?   \( \| \)   \{ \}[[:...:]]
posix-basic   *  \+   \?   \( \| \)   \{ \}[[:...:]]
posix-extended*   +?(  |  ){  }[[:...:]]

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualiz

Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-15 Thread Bruno Haible via Gnulib discussion list
Eric Blake wrote:
> > +/* There is no need to check whether RE_SYNTAX_EMACS is
> > +   (RE_CHAR_CLASSES | RE_INTERVALS), corresponding to
> > +   Emacs 21 (2001) and later, because Gnulib's lib/regex.h
> > +   is always used and has this value.  */
> 
> Or are we really guaranteed that even when we use gnulib's regex
> module, we are using gnulib's regex.h but can still use the unreplaced
> glibc implementations?

Yes, we have this guarantee (unless the maintainer has forgotten to add the
appropriate -I options to their Makefile.am, but that would most likely cause
trouble elsewhere).

Bruno






Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-15 Thread Eric Blake
On Mon, Apr 14, 2025 at 02:47:06PM -0700, Paul Eggert wrote:
> +++ b/doc/posix-headers/regex.texi
> @@ -11,6 +11,19 @@ regex.h
>  @item
>  This header file is missing on some platforms:
>  mingw, MSVC 14.
> +
> +@item
> +On some systems that have this header file,
> +GNU extensions like @code{re_set_syntax} and @code{RE_SYNTAX_EMACS}
> +are not declared or supported:
> +FreeBSD 14.2, OpenBSD 7.6, NetBSD 10.1, macOS 15,
> +Minix 3.3.0, AIX 7.3, HP-UX 11, Solaris 11.4.
> +
> +@item
> +On some systems that support GNU extensions, @code{RE_SYNTAX_EMACS} is 0
> +even though it should be @code{(RE_CHAR_CLASSES | RE_INTERVALS)}
> +to be compatible with Emacs 21 (2001) and later:
> +glibc 2.41, Cygwin 2.6.x.
>  @end itemize
>  
>  Portability problems not fixed by Gnulib:

Should the paragraph on RE_SYNTAX_EMACS being 0 on some platforms be
listed in the section of portability problems NOT fixed by Gnulib,
since we are now intentionally NOT overriding glibc merely because of
that difference?

> +++ b/m4/regex.m4
> @@ -1,5 +1,5 @@
>  # regex.m4
> -# serial 80
> +# serial 81
>  dnl Copyright (C) 1996-2001, 2003-2025 Free Software Foundation, Inc.
>  dnl This file is free software; the Free Software Foundation
>  dnl gives unlimited permission to copy and/or distribute it,
> @@ -54,9 +54,10 @@ AC_DEFUN([gl_REGEX]
>  static void sigabrt_no_core (int sig) { raise (SIGTERM); }
>  #endif
>  
> -#if !RE_SYNTAX_EMACS
> -# error "RE_SYNTAX_EMACS does not match Emacs behavior"
> -#endif
> +/* There is no need to check whether RE_SYNTAX_EMACS is
> +   (RE_CHAR_CLASSES | RE_INTERVALS), corresponding to
> +   Emacs 21 (2001) and later, because Gnulib's lib/regex.h
> +   is always used and has this value.  */

Or are we really guaranteed that even when we use gnulib's regex
module, we are using gnulib's regex.h but can still use the unreplaced
glibc implementations?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-14 Thread Paul Eggert

On 4/14/25 07:30, Bruno Haible wrote:

   - Application code sees the gnulib regex.h (due to -I options).
 Things would be different if the 'regex' module would be using a
 regex.in.h from which a regex.h is conditionally generated. But
 the way things are, gnulib regex.h is used unconditionally.

   - Nothing in the reg*.c files uses RE_SYNTAX_EMACS.


Thanks, good catch. I installed into Gnulib the attached patch, which 
implements your suggestion and also adds some documentation about this 
issue.From 6ff0345dd56403f8a7953602b10c914f81f80d20 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Mon, 14 Apr 2025 14:43:02 -0700
Subject: [PATCH] =?UTF-8?q?regex:=20don=E2=80=99t=20check=20RE=5FSYNTAX=5F?=
 =?UTF-8?q?EMACS?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* m4/regex.m4 (gl_REGEX): Do not check RE_SYNTAX_EMACS’s value.
Suggested by Bruno Haible in:
https://lists.gnu.org/r/bug-gnulib/2025-04/msg00098.html
---
 ChangeLog|  7 +++
 doc/posix-headers/regex.texi | 13 +
 m4/regex.m4  |  9 +
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 634f5b089e..1d4a399379 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@
+2025-04-14  Paul Eggert  
+
+	regex: don’t check RE_SYNTAX_EMACS
+	* m4/regex.m4 (gl_REGEX): Do not check RE_SYNTAX_EMACS’s value.
+	Suggested by Bruno Haible in:
+	https://lists.gnu.org/r/bug-gnulib/2025-04/msg00098.html
+
 2025-04-14  Bruno Haible  
 
 	c32is*, c32to* tests: Avoid test failures on macOS 15.4.
diff --git a/doc/posix-headers/regex.texi b/doc/posix-headers/regex.texi
index b54c243060..403088c6fc 100644
--- a/doc/posix-headers/regex.texi
+++ b/doc/posix-headers/regex.texi
@@ -11,6 +11,19 @@ regex.h
 @item
 This header file is missing on some platforms:
 mingw, MSVC 14.
+
+@item
+On some systems that have this header file,
+GNU extensions like @code{re_set_syntax} and @code{RE_SYNTAX_EMACS}
+are not declared or supported:
+FreeBSD 14.2, OpenBSD 7.6, NetBSD 10.1, macOS 15,
+Minix 3.3.0, AIX 7.3, HP-UX 11, Solaris 11.4.
+
+@item
+On some systems that support GNU extensions, @code{RE_SYNTAX_EMACS} is 0
+even though it should be @code{(RE_CHAR_CLASSES | RE_INTERVALS)}
+to be compatible with Emacs 21 (2001) and later:
+glibc 2.41, Cygwin 2.6.x.
 @end itemize
 
 Portability problems not fixed by Gnulib:
diff --git a/m4/regex.m4 b/m4/regex.m4
index 52ce5c3b37..49a8059f61 100644
--- a/m4/regex.m4
+++ b/m4/regex.m4
@@ -1,5 +1,5 @@
 # regex.m4
-# serial 80
+# serial 81
 dnl Copyright (C) 1996-2001, 2003-2025 Free Software Foundation, Inc.
 dnl This file is free software; the Free Software Foundation
 dnl gives unlimited permission to copy and/or distribute it,
@@ -54,9 +54,10 @@ AC_DEFUN([gl_REGEX]
 static void sigabrt_no_core (int sig) { raise (SIGTERM); }
 #endif
 
-#if !RE_SYNTAX_EMACS
-# error "RE_SYNTAX_EMACS does not match Emacs behavior"
-#endif
+/* There is no need to check whether RE_SYNTAX_EMACS is
+   (RE_CHAR_CLASSES | RE_INTERVALS), corresponding to
+   Emacs 21 (2001) and later, because Gnulib's lib/regex.h
+   is always used and has this value.  */
   ]],
   [[int result = 0;
 static struct re_pattern_buffer regex;
-- 
2.49.0



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-14 Thread Bruno Haible via Gnulib discussion list
Hi Paul,

Paul Eggert wrote:
> > Thanks; I installed the attached somewhat fancier patch into Gnulib.
> 
> ... and then installed the attach further patch to fix a thinko in the 
> previously-mentioned one.

These changes have the effect that even on glibc systems, gnulib's regex
code is now used instead of glibc's. This is what happens at configure
time:

configure:29480: checking for working re_compile_pattern
configure:29802: clang 
-fsanitize=address,undefined,signed-integer-overflow,shift,integer-divide-by-zero
 -fno-sanitize-recover=undefined -o conftest -O0 -fno-omit-frame-pointer -ggdb 
-Wall -Wthread-safety conftest.c  >&5
conftest.c:207:15: error: "RE_SYNTAX_EMACS does not match Emacs behavior"
  207 | # error "RE_SYNTAX_EMACS does not match Emacs behavior"
  |   ^

And, in fact, this is not necessary, because

  - Application code sees the gnulib regex.h (due to -I options).
Things would be different if the 'regex' module would be using a
regex.in.h from which a regex.h is conditionally generated. But
the way things are, gnulib regex.h is used unconditionally.

  - Nothing in the reg*.c files uses RE_SYNTAX_EMACS.

It is just added baggage for applications.

I therefore suggest to revert the changes to m4/regex.m4.

Bruno






Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-13 Thread Bruno Haible via Gnulib discussion list
Paul Eggert wrote:
> >>- Will they change RE_BACKSLASH_ESCAPE_IN_LISTS from
> >>((unsigned long int) 1)
> >>  to
> >>(1UL)
> >>  in order to make the RE_* values usable in preprocessor expressions?
> > 
> > No; I followed up with a later patch to remove that assumption.
> 
> Oops, sorry, I misunderstood your comment. You're right, we need a patch 
> to do that. I just installed the attached .

You attached this patch in Gnulib, but not also in glibc. If we don't want
to make assumptions about whether glibc makes the same change, how about
this patch?

diff --git a/m4/regex.m4 b/m4/regex.m4
index 52ce5c3b37..508352ab43 100644
--- a/m4/regex.m4
+++ b/m4/regex.m4
@@ -54,9 +54,10 @@ AC_DEFUN([gl_REGEX]
 static void sigabrt_no_core (int sig) { raise (SIGTERM); }
 #endif
 
-#if !RE_SYNTAX_EMACS
-# error "RE_SYNTAX_EMACS does not match Emacs behavior"
-#endif
+/* Check whether RE_SYNTAX_EMACS matches Emacs behavior.
+   Note that in many glibc versions, the RE_* macros cannot
+   be used in preprocessor expressions.  */
+typedef int emacs_check[RE_SYNTAX_EMACS ? 1 : -1];
   ]],
   [[int result = 0;
 static struct re_pattern_buffer regex;






Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-13 Thread Paul Eggert

On 2025-04-13 20:39, Paul Eggert wrote:


   - Will they change RE_BACKSLASH_ESCAPE_IN_LISTS from
   ((unsigned long int) 1)
 to
   (1UL)
 in order to make the RE_* values usable in preprocessor expressions?


No; I followed up with a later patch to remove that assumption.


Oops, sorry, I misunderstood your comment. You're right, we need a patch 
to do that. I just installed the attached .From 23c09f69c9d6e43d614934c4de94c294a811fdf0 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 13 Apr 2025 20:47:54 -0700
Subject: [PATCH] regex: make RE_* usable in #if

* lib/regex.h (RE_BACKSLASH_ESCAPE_IN_LISTS):
Define to 1ul so that the RE_* macros can be used in #if.
---
 ChangeLog   | 2 ++
 lib/regex.h | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/ChangeLog b/ChangeLog
index e202b14530..210be85262 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -3,6 +3,8 @@
 	regex: pacify gcc -Wcalloc-transposed-args
 	* lib/regcomp.c (init_dfa, parse_bracket_exp)
 	(build_charclass_op):
+	* lib/regex.h (RE_BACKSLASH_ESCAPE_IN_LISTS):
+	Define to 1ul so that the RE_* macros can be used in #if.
 	* lib/regex_internal.c (create_ci_newstate, create_cd_newstate):
 	* lib/regexec.c (get_subexp, build_trtable):
 	When calling calloc, put size argument last.
diff --git a/lib/regex.h b/lib/regex.h
index c4c6089a8c..ff7e43b534 100644
--- a/lib/regex.h
+++ b/lib/regex.h
@@ -73,7 +73,7 @@ typedef unsigned long int reg_syntax_t;
 #ifdef __USE_GNU
 /* If this bit is not set, then \ inside a bracket expression is literal.
If set, then such a \ quotes the following character.  */
-# define RE_BACKSLASH_ESCAPE_IN_LISTS ((unsigned long int) 1)
+# define RE_BACKSLASH_ESCAPE_IN_LISTS 1ul
 
 /* If this bit is not set, then + and ? are operators, and \+ and \? are
  literals.
-- 
2.45.2



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-13 Thread Paul Eggert

On 2025-04-13 19:28, Bruno Haible wrote:

Paul Eggert wrote:

Thanks; I installed the attached somewhat fancier patch into Gnulib.


... and then installed the attach further patch to fix a thinko in the
previously-mentioned one.


I don't understand this patch. What changes are you expecting to come
on the glibc side?

   - Will they change RE_BACKSLASH_ESCAPE_IN_LISTS from
   ((unsigned long int) 1)
 to
   (1UL)
 in order to make the RE_* values usable in preprocessor expressions?


No; I followed up with a later patch to remove that assumption.


   - Will they change the value of RE_SYNTAX_EMACS? Or can't they do this,
 because that would break backward compatibility?


The idea is to change the value of RE_SYNTAX_EMACS, yes.



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-13 Thread Bruno Haible via Gnulib discussion list
Paul Eggert wrote:
> > Thanks; I installed the attached somewhat fancier patch into Gnulib.
> 
> ... and then installed the attach further patch to fix a thinko in the 
> previously-mentioned one.

I don't understand this patch. What changes are you expecting to come
on the glibc side?

  - Will they change RE_BACKSLASH_ESCAPE_IN_LISTS from
  ((unsigned long int) 1)
to
  (1UL)
in order to make the RE_* values usable in preprocessor expressions?

  - Will they change the value of RE_SYNTAX_EMACS? Or can't they do this,
because that would break backward compatibility?

Bruno






Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-13 Thread Paul Eggert

On 2025-04-13 18:04, Paul Eggert wrote:

Thanks; I installed the attached somewhat fancier patch into Gnulib.


... and then installed the attach further patch to fix a thinko in the 
previously-mentioned one.From f6d648a883676256894f5687cbceffb1f4209e3d Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 13 Apr 2025 18:53:17 -0700
Subject: [PATCH] =?UTF-8?q?regex:=20don=E2=80=99t=20assume=20RE=5FSYNTAX?=
 =?UTF-8?q?=5F*=20work=20in=20#if?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* m4/regex.m4 (gl_REGEX): Fix thinko that would have
prevented future glibc versions from passing the test.
---
 m4/regex.m4 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/m4/regex.m4 b/m4/regex.m4
index 1b2012fe00..52ce5c3b37 100644
--- a/m4/regex.m4
+++ b/m4/regex.m4
@@ -1,5 +1,5 @@
 # regex.m4
-# serial 79
+# serial 80
 dnl Copyright (C) 1996-2001, 2003-2025 Free Software Foundation, Inc.
 dnl This file is free software; the Free Software Foundation
 dnl gives unlimited permission to copy and/or distribute it,
@@ -54,7 +54,7 @@ AC_DEFUN([gl_REGEX],
 static void sigabrt_no_core (int sig) { raise (SIGTERM); }
 #endif
 
-#if RE_SYNTAX_EMACS != (RE_CHAR_CLASSES | RE_INTERVALS)
+#if !RE_SYNTAX_EMACS
 # error "RE_SYNTAX_EMACS does not match Emacs behavior"
 #endif
   ]],
-- 
2.45.2



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-13 Thread Paul Eggert

Thanks; I installed the attached somewhat fancier patch into Gnulib.From efd5c380ff8062541d5fd98b050ecd3cb295917c Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 13 Apr 2025 18:01:08 -0700
Subject: [PATCH] regex: match current Emacs behavior

* config/srclist.txt: Comment out regex.h, since we now
disagree with glibc.
* lib/regex.h (RE_SYNTAX_EMACS):
Match Emacs 21+ behavior, not Emacs 20-.
* m4/regex.m4 (gl_REGEX): Check for this Emacs fix.
---
 ChangeLog  | 9 +
 config/srclist.txt | 2 +-
 doc/regex.texi | 3 ++-
 lib/regex.h| 8 
 m4/regex.m4| 6 +-
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 2ea548d13f..c20c151757 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,12 @@
+2025-04-13  Paul Eggert  
+
+	regex: match current Emacs behavior
+	* config/srclist.txt: Comment out regex.h, since we now
+	disagree with glibc.
+	* lib/regex.h (RE_SYNTAX_EMACS):
+	Match Emacs 21+ behavior, not Emacs 20-.
+	* m4/regex.m4 (gl_REGEX): Check for this Emacs fix.
+
 2025-04-13  Bruno Haible  
 
 	getlogin_r tests: Avoid writing to a literal string.
diff --git a/config/srclist.txt b/config/srclist.txt
index 173f23edaf..62816dcf4a 100644
--- a/config/srclist.txt
+++ b/config/srclist.txt
@@ -68,7 +68,7 @@ $LIBCSRC malloc/scratch_buffer_set_array_size.c	lib/malloc
 #$LIBCSRC misc/sys/cdefs.h		lib
 #$LIBCSRC posix/regcomp.c		lib
 $LIBCSRC posix/regex.c			lib
-$LIBCSRC posix/regex.h			lib
+#$LIBCSRC posix/regex.h			lib
 #$LIBCSRC posix/regex_internal.c	lib
 #$LIBCSRC posix/regex_internal.h	lib
 #$LIBCSRC posix/regexec.c		lib
diff --git a/doc/regex.texi b/doc/regex.texi
index cba1e13520..925b0db639 100644
--- a/doc/regex.texi
+++ b/doc/regex.texi
@@ -316,7 +316,8 @@ regular expressions.
 The predefined syntaxes---taken directly from @file{regex.h}---are:
 
 @smallexample
-#define RE_SYNTAX_EMACS 0
+# define RE_SYNTAX_EMACS\
+  (RE_CHAR_CLASSES | RE_INTERVALS)
 
 #define RE_SYNTAX_AWK   \
   (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL   \
diff --git a/lib/regex.h b/lib/regex.h
index 67a3aa70a5..c4c6089a8c 100644
--- a/lib/regex.h
+++ b/lib/regex.h
@@ -66,9 +66,8 @@ typedef unsigned long int active_reg_t;
 
 /* The following bits are used to determine the regexp syntax we
recognize.  The set/not-set meanings are chosen so that Emacs syntax
-   remains the value 0.  The bits are given in alphabetical order, and
-   the definitions shifted by one from the previous bit; thus, when we
-   add or remove a bit, only one other definition need change.  */
+   is the value 0 for Emacs 20 (2000) and earlier, and the value
+   RE_SYNTAX_EMACS for Emacs 21 (2001) and later.  */
 typedef unsigned long int reg_syntax_t;
 
 #ifdef __USE_GNU
@@ -215,7 +214,8 @@ extern reg_syntax_t re_syntax_options;
(The [[[ comments delimit what gets put into the Texinfo file, so
don't delete them!)  */
 /* [[[begin syntaxes]]] */
-# define RE_SYNTAX_EMACS 0
+# define RE_SYNTAX_EMACS		\
+  (RE_CHAR_CLASSES | RE_INTERVALS)
 
 # define RE_SYNTAX_AWK			\
   (RE_BACKSLASH_ESCAPE_IN_LISTS   | RE_DOT_NOT_NULL			\
diff --git a/m4/regex.m4 b/m4/regex.m4
index 80dfb8e1e5..1b2012fe00 100644
--- a/m4/regex.m4
+++ b/m4/regex.m4
@@ -1,5 +1,5 @@
 # regex.m4
-# serial 78
+# serial 79
 dnl Copyright (C) 1996-2001, 2003-2025 Free Software Foundation, Inc.
 dnl This file is free software; the Free Software Foundation
 dnl gives unlimited permission to copy and/or distribute it,
@@ -53,6 +53,10 @@ AC_DEFUN([gl_REGEX],
 /* Exit with distinguishable exit code.  */
 static void sigabrt_no_core (int sig) { raise (SIGTERM); }
 #endif
+
+#if RE_SYNTAX_EMACS != (RE_CHAR_CLASSES | RE_INTERVALS)
+# error "RE_SYNTAX_EMACS does not match Emacs behavior"
+#endif
   ]],
   [[int result = 0;
 static struct re_pattern_buffer regex;
-- 
2.45.2



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-13 Thread Vladimir Gorsunov
Updated the comment. Emacs revision in which RE_CHAR_CLASSES is enabled 
is d24873d4, so syntax 0 must have been used for all releases before that


On 4/11/25 22:04, Eric Blake wrote:

On Fri, Apr 11, 2025 at 04:52:59PM +0300, Vladimir Gorsunov wrote:

   When GNU Emacs switched to using gnulib for regular expression
   functionality in the etags program, some features stopped working
   (please see https://debbugs.gnu.org/cgi/bugreport.cgi?bug=76945 for
   details). That is because RE_SYNTAX_EMACS flag combo in gnulib doesn't
   have the corresponding flags set. This value should be updated to
   fix etags and to better reflect the set of features GNU Emacs is
   using at the moment
 From 76f937ae2eacb3649117e7f4c05819e82a7c42a9 Mon Sep 17 00:00:00 2001
From: vg 
Date: Fri, 11 Apr 2025 16:28:29 +0300
Subject: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

* lib/regex.h: macro update
* doc/regex.texi: documentation update
---
  doc/regex.texi | 3 ++-
  lib/regex.h| 3 ++-
  2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/doc/regex.texi b/doc/regex.texi
index cba1e13520..9917a418be 100644
--- a/doc/regex.texi
+++ b/doc/regex.texi
@@ -316,7 +316,8 @@ regular expressions.
  The predefined syntaxes---taken directly from @file{regex.h}---are:
  
  @smallexample

-#define RE_SYNTAX_EMACS 0
+# define RE_SYNTAX_EMACS   \
+  (RE_CHAR_CLASSES | RE_INTERVALS)

Hmm.  GNU m4 1.4.19 documents that its regex engine matches emacs -
but that's only because m4 uses syntax 0.  If this change is made in
gnulib, then either th m4 manual needs to patched to state that it is
similar to emacs except for lacking character classes and intervals,
or we make a non-backwards-compatible change in m4 by actually using
RE_SYNTAX_EMACS instead of 0 for the default syntax.

Since there's already another long thread on how m4 does not match
current emacs regex but why enabling intervals would break at least
autoconf 2.72, I'm inclined to update the m4 manual rather than use
RE_SYNTAX_EMACS, whether or not this patch is accepted.

What's more, this patch is incomplete; if you change RE_SYNTAX_EMACS,
then you also need to change this paragraph:

/* The following bits are used to determine the regexp syntax we
recognize.  The set/not-set meanings are chosen so that Emacs syntax
remains the value 0.  The bits are given in alphabetical order, and
the definitions shifted by one from the previous bit; thus, when we
add or remove a bit, only one other definition need change.  */
From 0b7b548c2a547ab84adb0001e7d0629b5b6cb6f8 Mon Sep 17 00:00:00 2001
From: Vladimir Gorsunov 
Date: Sun, 13 Apr 2025 12:18:33 +0300
Subject: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

* lib/regex.h: macro update
* doc/regex.texi: documentation update
---
 doc/regex.texi |  3 ++-
 lib/regex.h| 12 +++-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/doc/regex.texi b/doc/regex.texi
index cba1e13520..9917a418be 100644
--- a/doc/regex.texi
+++ b/doc/regex.texi
@@ -316,7 +316,8 @@ regular expressions.
 The predefined syntaxes---taken directly from @file{regex.h}---are:
 
 @smallexample
-#define RE_SYNTAX_EMACS 0
+# define RE_SYNTAX_EMACS		\
+  (RE_CHAR_CLASSES | RE_INTERVALS)
 
 #define RE_SYNTAX_AWK   \
   (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL   \
diff --git a/lib/regex.h b/lib/regex.h
index 67a3aa70a5..316a8e48fd 100644
--- a/lib/regex.h
+++ b/lib/regex.h
@@ -65,10 +65,11 @@ typedef long int s_reg_t;
 typedef unsigned long int active_reg_t;
 
 /* The following bits are used to determine the regexp syntax we
-   recognize.  The set/not-set meanings are chosen so that Emacs syntax
-   remains the value 0.  The bits are given in alphabetical order, and
-   the definitions shifted by one from the previous bit; thus, when we
-   add or remove a bit, only one other definition need change.  */
+   recognize. The set/not-set meanings are chosen so that the value 0
+   is the syntax used originally by Emacs (pre 21.1, when features
+   started to get added). The bits are given in alphabetical order, and
+   the definitions shifted by one from the previous bit; thus, when
+   we add or remove a bit, only one other definition need change. */
 typedef unsigned long int reg_syntax_t;
 
 #ifdef __USE_GNU
@@ -215,7 +216,8 @@ extern reg_syntax_t re_syntax_options;
(The [[[ comments delimit what gets put into the Texinfo file, so
don't delete them!)  */
 /* [[[begin syntaxes]]] */
-# define RE_SYNTAX_EMACS 0
+# define RE_SYNTAX_EMACS		\
+  (RE_CHAR_CLASSES | RE_INTERVALS)
 
 # define RE_SYNTAX_AWK			\
   (RE_BACKSLASH_ESCAPE_IN_LISTS   | RE_DOT_NOT_NULL			\
-- 
2.31.1



Re: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs

2025-04-11 Thread Eric Blake
On Fri, Apr 11, 2025 at 04:52:59PM +0300, Vladimir Gorsunov wrote:
>   When GNU Emacs switched to using gnulib for regular expression
>   functionality in the etags program, some features stopped working
>   (please see https://debbugs.gnu.org/cgi/bugreport.cgi?bug=76945 for
>   details). That is because RE_SYNTAX_EMACS flag combo in gnulib doesn't
>   have the corresponding flags set. This value should be updated to
>   fix etags and to better reflect the set of features GNU Emacs is
>   using at the moment

> From 76f937ae2eacb3649117e7f4c05819e82a7c42a9 Mon Sep 17 00:00:00 2001
> From: vg 
> Date: Fri, 11 Apr 2025 16:28:29 +0300
> Subject: [PATCH] Update RE_SYNTAX_EMACS to include features used by GNU Emacs
> 
> * lib/regex.h: macro update
> * doc/regex.texi: documentation update
> ---
>  doc/regex.texi | 3 ++-
>  lib/regex.h| 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/regex.texi b/doc/regex.texi
> index cba1e13520..9917a418be 100644
> --- a/doc/regex.texi
> +++ b/doc/regex.texi
> @@ -316,7 +316,8 @@ regular expressions.
>  The predefined syntaxes---taken directly from @file{regex.h}---are:
>  
>  @smallexample
> -#define RE_SYNTAX_EMACS 0
> +# define RE_SYNTAX_EMACS \
> +  (RE_CHAR_CLASSES | RE_INTERVALS)

Hmm.  GNU m4 1.4.19 documents that its regex engine matches emacs -
but that's only because m4 uses syntax 0.  If this change is made in
gnulib, then either th m4 manual needs to patched to state that it is
similar to emacs except for lacking character classes and intervals,
or we make a non-backwards-compatible change in m4 by actually using
RE_SYNTAX_EMACS instead of 0 for the default syntax.

Since there's already another long thread on how m4 does not match
current emacs regex but why enabling intervals would break at least
autoconf 2.72, I'm inclined to update the m4 manual rather than use
RE_SYNTAX_EMACS, whether or not this patch is accepted.

What's more, this patch is incomplete; if you change RE_SYNTAX_EMACS,
then you also need to change this paragraph:

/* The following bits are used to determine the regexp syntax we
   recognize.  The set/not-set meanings are chosen so that Emacs syntax
   remains the value 0.  The bits are given in alphabetical order, and
   the definitions shifted by one from the previous bit; thus, when we
   add or remove a bit, only one other definition need change.  */

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org