Re: [Issue 8 drafts 0001798]: Must posix_getdents remember file offsets across exec?

2024-02-15 Thread Eric Blake via austin-group-l at The Open Group
Adding in Corinna Vinshcen, one of the Cygwin developers.  She had
problems trying to post directly on the bug page, so we can use email
replies and summarize the results back to the bug.

On Mon, Jan 22, 2024 at 03:30:20PM +, Austin Group Bug Tracker via 
austin-group-l at The Open Group wrote:
> 
> A NOTE has been added to this issue. 
> == 
> https://austingroupbugs.net/view.php?id=1798 
> == 
> Reported By:eblake
> Assigned To:
> == 
> Project:Issue 8 drafts
> Issue ID:   1798
> Category:   System Interfaces
> Type:   Clarification Requested
> Severity:   Objection
> Priority:   normal
> Status:     New
> Name:   Eric Blake 
> Organization:   Red Hat 
> User Reference: ebb.posix_getdents 
> Section:XSH posix_getdents 
> Page Number:1567 
> Line Number:52609 
> Final Accepted Text: 
> == 
> Date Submitted: 2024-01-22 15:13 UTC
> Last Modified:  2024-01-22 15:30 UTC
> == 
> Summary:Must posix_getdents remember file offsets across
> exec?
> == 
> 
> -- 
>  (0006632) eblake (manager) - 2024-01-22 15:30
>  https://austingroupbugs.net/view.php?id=1798#c6632 
> -- 
> Correction - I'm told that the attempted Cygwin implementation also has
> problems after dup(); it is unclear whether the states should be linked
> (reading an entry on one fd, grabbing its offset, then using the other fd
> to read entries, it is unclear whether the second fd starts reading from
> the point where the fd was at the time of dup() or at the shared point
> reached by the first fd, and whether the second fd can safely lseek() to
> the offset read by the first fd).  Easiest would be to state that dup() has
> the same limitations as fork()/exec - namely, that any mid-stream directory
> traversal in either side of the split is unspecified, and the only portable
> thing is to start a new traversal by lseek'ing back to 0 (at which point,
> the implementation no longer has to worry about sharing a half-read DIR*
> across fd copies or processes). 
> 
> Issue History 
> Date ModifiedUsername   FieldChange   
> == 
> 2024-01-22 15:13 eblake New Issue
> 2024-01-22 15:13 eblake Name  => Eric Blake  
> 2024-01-22 15:13 eblake Organization  => Red Hat 
> 2024-01-22 15:13 eblake User Reference=> 
> ebb.posix_getdents
> 2024-01-22 15:13 eblake Section   => XSH 
> posix_getdents
> 2024-01-22 15:13 eblake Page Number   => 1567
> 2024-01-22 15:13 eblake Line Number       => 52609   
> 2024-01-22 15:30 eblake Note Added: 0006632  
> ==
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: Questions on strftime vs. POSIX

2024-02-12 Thread Eric Blake via austin-group-l at The Open Group
Hello Paul,

The Austin Group meeting today revisited the topic, and came up with
some further thoughts on the matter.  Would you be available to attend
an upcoming Austin Group teleconference meeting (typically Mondays and
Thursdays, at 11am Eastern / 8am Pacific) to speed up the
back-and-forth resolution of issues that may still arise?  For a
typical meeting invite, see:
https://www.mail-archive.com/austin-group-l@opengroup.org/msg12295.html

The leading consensus during today's meeting was that we would very
much like mktime() and strftime("%s") to produce the same value within
a given implementation, while still warning the user that the presence
of time zones means that there are ambiguous cases where different
implementations may come up with different results (one using the
offset present before crossing the shift point; the other normalizing
to the offset present after crossing the shift point).  That means we
probably need to reword the requirements for mktime() at the same time
we pick up your suggested changes for strftime().  The preliminary
thought on how to accomplish that was as follows:

| Suggested changes ...
| 
| On page 1428 line 47966 section mktime(), change:
| 
| shall calculate the time since the Epoch value using either 
the offset in effect before the change or the offset in effect after the 
change.
| 
| to:
| 
| shall calculate the time since the Epoch value using either 
the offset in effect before the change or the offset in effect after the 
change; mktime() may use the value of tm_gmtoff to decide which 
of these two results is the more appropriate to return.
| 
| (and add tm_gmtoff to the list of fields strftime %s can use)

Note that the suggested wording leans towards only using tm_gmtoff to
disambiguate, rather than using it always; but that sounds like it
differs from TZDB where your latest patches appear to use tm_gmtoff
always and ignore tm_isdst and global environment; and we still want
to see if the final wording can allow TZDB's behavior to be considered
compliant.

Another common thought expressed today is that the sequence
strftime("%s",...,gmtime(my_timet)) is, more often than not, likely to
be a bug (you even said so in your point (3)), in that it is not
obvious whether it will produce a value relative to UTC or the user's
current timezone.  Although your earlier email expressed a desire to
let it follow the principle of least surprise for the user, we are
wondering if the standard should call out in APPLICATION USAGE that
such a usage is only portable if the user has also done something like
setenv("TZ", "UTC0", 1) (and possibly also tzset()) in close
proximity; as well as calling out the fact that multi-threaded
applications may need to take even more steps to be careful of global
environment manipulations.  We probably need to amend the standard to
state that gmtime() must set tm_gmtoff to 0 (right now, that
requirement is not there), along with everything else being touched in
the resolution of bug 1797.  Then again, if you are going to set TZ to
UTC0, localtime() and gmtime() should produce the same values (or am I
overlooking a case where they can be different?) - at which point, the
standard can be more explicit in recommending that strftime("%s") is
best used with localtime() rather than gmtime().

Is there any chance of exploiting a flag character in the format
string, such as "%#s" meaning to interpret the struct tm as generated
by gmtime() rather than by the local time zone?  I note that GNU
date(1) has already commandeered %:z, %::z, and %:::z as extensions to
produce various different formattings of %z, as the reason for
considering how such an extension might work.  But at this point, that
would be too much invention to directly include in the resolution for
bug 1797.

If nothing else, the mental contortions required to think about the
best path forward (whether we need to add even more wording to allow
existing implementations to remain compliant while still allowing the
best quality-of-implementation to work in the maximum number of
scenarios) gave us all the more reason to give more weight to the idea
of eventually standardizing tzalloc() and friends (along with a
replacement to strftime() that takes an explicit timezone argument)
for Issue 9; but first we have to get bug 1797 ready for Issue 8 TC1.
https://austingroupbugs.net/view.php?id=1794

For the Austin Group, I will also point out that Paul has recently
been active in a current conversation on the bug-gnulib mailing list,
where developers are trying to come up with a nicer wrapper functions
that takes both struct tm and nanoseconds (for a %n specifier), as
well as an indication of local vs UTC timezone, and produces a useful
time format from a single interface.  For example,
https://lists.gnu.org/archive/html/bug-gnulib/2024-02/msg00077.html
https://lists.gnu.org/archive/html/bug-gnulib/2024-02/msg00064.html

Eric Blak

Re: Re: Questions on strftime vs. POSIX

2024-02-07 Thread Eric Blake via austin-group-l at The Open Group
Widening the scope of this conversation, with Paul's permission.

Context for the Open Group readers: per my Action Item from Monday's
meeting, I emailed Paul regarding
https://austingroupbugs.net/bug_view_page.php?bug_id=1797

On Mon, Feb 05, 2024 at 10:51:34AM -0800, Paul Eggert wrote:
> On 2024-02-05 08:15, Eric Blake wrote:
> 
> > Did you consider the effect of the change on applications that
> > populate struct tm directly (and don't currently set tm_gmtoff, except
> > perhaps by zeroing the structure)?
> 
> Yes. Very few apps do that. (I looked for some in the GNU code I help
> maintain, and found none.) They are greatly outnumbered by the applications
> that call localtime/localtime_r/mktime/gmtime/gmtime_r/etc. and pass the
> result to strftime, which is what this bug report is about.
> 
> 
> > Does the latest tzdata code only use tm_gmtoff in the rare cases when
> > it is necessary for disambiguation, or is it always used (overriding
> > the timezone data)?  The bug description implies the former, but the
> > desired action would allow the latter.
> 
> The former. That is, TZDB 2024a strftime looks only at tm_gmtoff, tm_year,
> tm_mon, tm_day, tm_hour, tm_min, and tm_sec to determine %s, because that's
> all you need.
> 
> The desired action allows either the TZDB behavior, or the glibc behavior
> which if I recall consults tm_gmtoff only when tm_isdst is ambiguous. The
> TZDB behavior is technically better than the glibc behavior for three
> reasons: (1) it removes a multithreading bottleneck, (2) even in a
> single-threaded platform it's faster because mktime is slower than using
> tm_gmtoff, and (3) when user code mistakenly calls gmtime and then strftime
> then %s does what the user expects. The bug report that caused TZDB to
> behave this way was about (3), but (1) and (2) also play a part.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: Recommendation for POSIX ed consideration

2023-12-11 Thread Eric Blake via austin-group-l at The Open Group
Hello Andrew, I'm forwarding your message on to the full Austin Group.

On Sun, Dec 10, 2023 at 11:37:40PM -0500, Andrew L. Moore wrote:
> Hi,
> I am the author of the original GNU ed and maintain an alternative (and I
> might add, much more robust) version at github.com/slewsys/ed.
> 
> One thing that I'd love to see the POSIX committee explore is the exit
> status of ed.  Per the standard:
> 
> EXIT STATUS
> 
> The following exit values shall be returned:
> 
>  0.  Successful completion without any file or command errors.
>  >0.  An error occurred.
> 
> The problem with this behavior is that, in interactive use, it common to
> make errors, correct them and then write the corrected file.  But by exiting
> with an error, even after successfully writing, this prevents ed from being
> used as the editor for many utilties, which abort when the editor exits with
> a non-zero error code.
> 
> In the version of GNU ed handed over to Antonio, the behavior was that after
> a successful write, the error status is reset to zero.  This had no impact
> on traditional scripting and merely allowed ed to be much more friendly,
> e.g., for writing git commits. Unfortunately, Antonio updated GNU ed at some
> point to follow POSIX, which is sub-optimal.
> -AM
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: bug#65659: RFC: changing printf(1) behavior on %b

2023-09-01 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Sep 01, 2023 at 07:19:13AM +0200, Phi Debian wrote:
> Well after reading yet another thread regarding libc_printf() I got to
> admit that even %B is crossed out, (Yet already choosen by ksh93)
> 
> The other thread also speak about libc_printf() documentting %# as
> undefined for things other than  a, A, e, E, f, F, g, and G, yet the same
> thread also talk about a A comming late (citing C99) in the dance, meaning
> what is undefined today become defined tomorow, so %#b is no safer.
>

Caution: The proposal here is for %#s (an alternative string), not %#b
(which C2x wants to be similar to %#x, in that it outputs a '0b'
prefix for all values except bare '0').

Yes, there is a slight risk that C may decide to define %#s.  But as
the Austin Group includes a member of WG14, we are able to advise the
C committee that such an addition is not wise.

> My guess is that printf(1) is now doomed to follow its route, keep its old
> format exception, and then may be implement something like c_printf like
> printf but the format string follow libc semantic, or may be a -C option to
> printf(1)...

Adding an option to printf is also a possibility, if there is
wide-spread implementation practice to standardize.  If someone wants
to implement 'printf -C' right now, that could help feed such a future
standardization.  But it is somewhat orthogonal to the request in this
thread, which is how to allow users to still access the old %b
behavior even if %b gets repurposed in the future; if we can get
multiple implementations to add a %#s alias now, it makes the future
decisions easier (even if it is too late for Issue 8 to add any new
features, or for that matter, to make any normative changes other than
marking %b obsolescent as a way to be able to revisit it in the future
for Issue 9).


> 
> Well in all case %b can not change semantic in the bash script, since it is
> there for so long, even if it depart from python, perl, libc, it is
> unfortunate but that's the way it is, nobody want a semantic change, and on
> next routers update, see the all internet falling appart :-)

How many scripts in the wild actually use %b, though?  And if there
are such scripts, anything we can do to make it easy to do a drop-in
replacement that still preserves the old behavior (such as changing %b
to %#s) is going to be easier to audit than the only other
currently-portable alternative of actually analyzing the string to see
if it uses any octal or \c escapes that have to be re-written to
portably function as a printf format argument.

POSIX is not mandating %#s at this time, so much as suggesting that if
implementations are willing to implement it now, it will make Issue 9
easier to reason about.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: bug#65659: RFC: changing printf(1) behavior on %b

2023-09-01 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Sep 01, 2023 at 08:59:19AM +0100, Stephane Chazelas wrote:
> 2023-08-31 15:02:22 -0500, Eric Blake via austin-group-l at The Open Group:
> [...]
> > The current POSIX says that %b was added so that on a non-XSI
> > system, you could do:
> > 
> > my_echo() {
> >   printf %b\\n "$*"
> > }
> 
> That is dependant on the current value of $IFS. You'd need:
> 
> xsi_echo() (
>   IFS=' '
>   printf '%b\n' "$*"
> )

Let's read the standard in context (Issue 8 draft 3 page 2793 line 92595):

"
The printf utility can be used portably to emulate any of the traditional 
behaviors of the echo
utility as follows (assuming that IFS has its standard value or is unset):
• The historic System V echo and the requirements on XSI implementations in 
this volume of
  POSIX.1-202x are equivalent to:
printf "%b\n" "$*"
"

So yes, the standard does mention the requirement to have a sane IFS,
and I failed to include that in my one-off implementation of
my_echo().  Thank you for pointing out a more robust version.

> 
> Or the other alternatives listed at
> https://unix.stackexchange.com/questions/65803/why-is-printf-better-than-echo/65819#65819
> 
> [...]
> > Bash already has shopt -s xpg_echo
> 
> Note that in bash, you need both
> 
> shopt -s xpg_echo
> set -o posix
> 
> To get a XSI echo. Without the latter, options are still
> recognised. You can get a XSI echo without those options with:
> 
> xsi_echo() {
>   local IFS=' ' -
>   set +o posix
>   echo -e "$*\n\c"
> }
> 
> The addition of those \n\c (noop) avoids arguments being treated as
> options if they start with -.

As an extension, Bash (and Coreutils) happen to honor \c always, and
not just for %b.  But POSIX only requires \c handling for %b.

And while Issue 8 has taken steps to allow implementations to support
'echo -e', it is still not standardized behavior; so your xsi_echo()
is bash-specific (which is not necessarily a problem, as long as you
are aware it is not portable).

> [...]
> > The Austin Group also felt that standardizing bash's behavior of %q/%Q
> > for outputting quoted text, while too late for Issue 8, has a good
> > chance of success, even though C says %q is reserved for
> > standardization by C. Our reasoning there is that lots of libc over
> > the years have used %qi as a synonym for %lli, and C would be foolish
> > to burn %q for anything that does not match those semantics at the C
> > language level; which means it will likely never be claimed by C and
> > thus free for use by shell in the way that bash has already done.
> [...]
> 
> Note that %q is from ksh93, not bash and is not portable across
> implementations and with most including bash's gives an output
> that is not safe for reinput in arbitrary locales (as it uses
> $'...' in some cases), not sure  it's a good idea to add it to
> the standard, or at least it should come with fat warnings about
> the risk in using it.

%q is NOT being added to Issue 8, but $'...' is.  Bug 1771 asked if %q
could be added to Issue 8, but it came it past the deadline for
feature requests, so the best we could do is add a FUTURE DIRECTIONS
blurb that mentions the idea.  But since FUTURE DIRECTIONS is
non-normative, we can always change our mind in Issue 9 and delete
that text if it turns out we can't get consensus to standardize some
form of %q/%Q after all.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: bug#65659: RFC: changing printf(1) behavior on %b

2023-08-31 Thread Eric Blake via austin-group-l at The Open Group
On Thu, Aug 31, 2023 at 03:10:58PM -0400, Chet Ramey wrote:
> On 8/31/23 11:35 AM, Eric Blake wrote:
> > In today's Austin Group call, we discussed the fact that printf(1) has
> > mandated behavior for %b (escape sequence processing similar to XSI
> > echo) that will eventually conflict with C2x's desire to introduce %b
> > to printf(3) (to produce 0b000... binary literals).
> > 
> > For POSIX Issue 8, we plan to mark the current semantics of %b in
> > printf(1) as obsolescent (it would continue to work, because Issue 8
> > targets C17 where there is no conflict with C2x), but with a Future
> > Directions note that for Issue 9, we could remove %b entirely, or
> > (more likely) make %b output binary literals just like C.
> 
> I doubt I'd ever remove %b, even in posix mode -- it's already been there
> for 25 years.

But the longer that printf(3) supports "%b" to output binary values,
the more surprised new shell coders will be that printf(1) %b does not
behave the same.  What's more, other languages have already started
using %b for binary output (python, for example), so it is definitely
gaining in mindshare.

That said, I also agree with your desire to keep the functionality in
place.  The current POSIX says that %b was added so that on a non-XSI
system, you could do:

my_echo() {
  printf %b\\n "$*"
}

and then call my_echo everywhere that a script used to depend on XSI
echo (perhaps by 'alias echo=my_echo' with aliases enabled), for a
much quicker portability hack than a tedious search-and-replace of
every echo call that requires manual inspection of its arguments for
translation of any XSI escape sequences into printf format
specifications.  In particular, code like [var='...\c'; echo "$var"]
cannot be changed to use printf by a mere s/echo/printf %s\\n/.  Thus,
when printf was invented and standardized for the shell, the solution
at the time was to create [printf %b\\n "$var"] as a drop-in
replacement for XSI [echo "$var"], even for platforms without XSI
echo.

Nowadays, I personally have not seen very many scripts like this in
the wild (for example, autoconf scripts prefer to directly use printf,
rather than trying to shoe-horn behavior into echo).  But assuming
such legacy scripts still exist, it is still much easier to rewrite
just the my_echo wrapper to now use %#s\\n instead of %b\\n, than it
would be to find every callsite of my_echo.

Bash already has shopt -s xpg_echo; I could easily see this being a
case where you toggle between the old or new behavior of %b (while
keeping %#s always at the old behavior) by either this or some other
shopt in bash, so that newer script writers that want binary output
for %b can do so with one setting, while scripts that must continue to
run under old semantics can likewise do so.

> 
> > But that
> > raises the question of whether the escape-sequence processing
> > semantics of %b should still remain available under the standard,
> > under some other spelling, since relying on XSI echo is still not
> > portable.
> > 
> > One of the observations made in the meeting was that currently, both
> > the POSIX spec for printf(1) as seen at [1], and the POSIX and C
> > standard (including the upcoming C2x standard) for printf(3) as seen
> > at [3] state that both the ' and # flag modifiers are currently
> > undefined when applied to %s.
> 
> Neither one is a very good choice, but `#' is the better one. It at least
> has a passing resemblence to the desired functionality.

Indeed, that's what the Austin Group settled on today after I first
wrote my initial email, and what I wrote up in a patch to GNU
Coreutils (https://debbugs.gnu.org/65659)

> 
> Why not standardize another character, like %B? I suppose I'll have to look
> at the etherpad for the discussion. I think that came up on the mailing
> list, but I can't remember the details.

Yes, https://austingroupbugs.net/view.php?id=1771 has a good
discussion of the various ideas.

%B is out for the same reason as %b: although the current C2x draft
wording says that % is reserved for implementation use, other
than [AEFGX] which already have a history of use by C (as it was, when
C99 added %A, that caused problems for some folks), it goes on to
_highly_ encourage any implementation that adds %b for "0b0" binary
output also add %B for "0B0" binary output (to match the x/X
dichotomy).  Burning %B to retain the old behavior while repurposing
%b to output lower-case binary values is thus a non-starter, while
burning %#s (which C says is undefined) felt nicer.

The Austin Group also felt that standardizing bash's behavior of %q/%Q
for outputting quoted text, while too late for Issue 8, has a good
chance of success, even though C says %q is reserved for
standardization by C. Our reasoning there is tha

RFC: changing printf(1) behavior on %b

2023-08-31 Thread Eric Blake via austin-group-l at The Open Group
In today's Austin Group call, we discussed the fact that printf(1) has
mandated behavior for %b (escape sequence processing similar to XSI
echo) that will eventually conflict with C2x's desire to introduce %b
to printf(3) (to produce 0b000... binary literals).

For POSIX Issue 8, we plan to mark the current semantics of %b in
printf(1) as obsolescent (it would continue to work, because Issue 8
targets C17 where there is no conflict with C2x), but with a Future
Directions note that for Issue 9, we could remove %b entirely, or
(more likely) make %b output binary literals just like C.  But that
raises the question of whether the escape-sequence processing
semantics of %b should still remain available under the standard,
under some other spelling, since relying on XSI echo is still not
portable.

One of the observations made in the meeting was that currently, both
the POSIX spec for printf(1) as seen at [1], and the POSIX and C
standard (including the upcoming C2x standard) for printf(3) as seen
at [3] state that both the ' and # flag modifiers are currently
undefined when applied to %s.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html
"The format operand shall be used as the format string described in
XBD File Format Notation[2] with the following exceptions:..."

[2] 
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap05.html#tag_05
"The flag characters and their meanings are: ...
# The value shall be converted to an alternative form. For c, d, i, u,
  and s conversion specifiers, the behavior is undefined.
[and no mention of ']"

[3] https://pubs.opengroup.org/onlinepubs/9699919799/functions/printf.html
"The flag characters and their meanings are:
' [CX] [Option Start] (The .) The integer portion of the
  result of a decimal conversion ( %i, %d, %u, %f, %F, %g, or %G )
  shall be formatted with thousands' grouping characters. For other
  conversions the behavior is undefined. The non-monetary grouping
  character is used. [Option End]
...
# Specifies that the value is to be converted to an alternative
  form. For o conversion, it shall increase the precision, if and only
  if necessary, to force the first digit of the result to be a zero
  (if the value and precision are both 0, a single 0 is printed). For
  x or X conversion specifiers, a non-zero result shall have 0x (or
  0X) prefixed to it. For a, A, e, E, f, F, g, and G conversion
  specifiers, the result shall always contain a radix character, even
  if no digits follow the radix character. Without this flag, a radix
  character appears in the result of these conversions only if a digit
  follows it. For g and G conversion specifiers, trailing zeros shall
  not be removed from the result as they normally are. For other
  conversion specifiers, the behavior is undefined."

Thus, it appears that both %#s and %'s are available for use for
future standardization.  Typing-wise, %#s as a synonym for %b is
probably going to be easier (less shell escaping needed).  Is there
any interest in a patch to coreutils or bash that would add such a
synonym, to make it easier to leave that functionality in place for
POSIX Issue 9 even when %b is repurposed to align with C2x?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: encoding question

2023-07-18 Thread Eric Blake via austin-group-l at The Open Group
On Sat, Jul 15, 2023 at 10:41:49PM +, Thorsten Glaser via austin-group-l at 
The Open Group wrote:
> Hi,
> 
> I get that the POSIX locale must be a single-byte character locale
> where all 256 octets are characters. I’ve got a question about the
> wide character representation.
> 
> Assuming my POSIX locale uses ASCII as encoding, I’ve got the whole
> portable character set (and then some) in the first 128 codepoints,
> which have the ASCII code as both octet SBCS value and wchar_t value.
> In this scenario, is it permissible to map the other 128 codepoints
> “high” i.e. to wchar_t values > 0x0100?

You're not the first to ask this question.  Here's a link to a
proposed patch to glibc on the same topic just this month, after
noting that musl has already dealt with it:

https://sourceware.org/pipermail/libc-alpha/2023-July/149588.html
https://sourceware.org/pipermail/libc-alpha/2023-July/150021.html
https://www.openwall.com/lists/musl/2022/11/10/2

The conclusion in those links appears to be that it is compliant to
have the 8-bit characters map to wchar_t codepoints that are not valid
Unicode characters, but which are distinct enough to preserve all
other properties needed to treat the POSIX locale as a single-byte
locale with 256 "characters" and proper collation sequence without
encoding errors.  Whether the mapping is to the 0xdcXX or 0xdfXX range
of reserved codepoints in Unicode is a matter of implementation
choice; both choices exist in implementations already out there.

> 
> I’m reading the standard as yes, but not asking already landed me
> in trouble in the past so I’d rather…

That's a wise course of action.  And while maybe the standard could
make this easier, the fact that there are already two commonly chosen
ranges already in play is not going to make it easy to mandate a
specific mapping.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [PATCH] sockaddr.3type: Document that sockaddr_storage is the API to be used

2023-04-21 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Apr 21, 2023 at 05:00:14PM +0200, Alejandro Colomar wrote:
> > 
> > The wording I see in <https://austingroupbugs.net/view.php?id=1641#c6262>
> > doesn't seem to cover the case of aliasing a sockaddr_storage as a
> > protocol-specific address for setting other members.
> > 
> > Aliasing rules don't allow one to declare an object of type
> > sockaddr_storage and then fill the structure as if it were another
> > structure, even if alignment and size are correct.  We would need
> > some wording that says something like:
> > 
> > When a pointer to a sockaddr_storage structure is first aliased as a
> > pointer to a protocol-specific address structure, the effective type
> > of the object will be set to the protocol-specific structure.

I'll add that as a comment to the Austin Group page; it seems like a
reasonable statement of intent (POSIX already says that struct
sockaddr_storage is sufficiently sized and aligned; all that remains
is for the compiler to be aware that we intend to use a
more-appropriate effective type once we have the storage allocated).

> > 
> > This is similar to what happens when malloc(3) is assigned to a
> > non-character type.  That's a big hammer, but it does the job.  Maybe
> > we would need some looser language?  I CCd GCC, in case they have
> > concerns about this wording.
> > 
> > Cheers,
> > Alex
> > 
> >>
> >> I quite like this way of putting it.  It subsumes both what I wrote and 
> >> the related potential headache with deciding whether the sa_family_t 
> >> field is considered an object or just a range of bytes within a larger 
> >> object.
> >>
> >> zw
> > 
> 
> For the man pages, I've rewritten it to the following:
> 
> 
> $ git diff
> diff --git a/man3type/sockaddr.3type b/man3type/sockaddr.3type
> index 2fdf56c59..e610aa0f5 100644
> --- a/man3type/sockaddr.3type
> +++ b/man3type/sockaddr.3type
> @@ -117,6 +117,14 @@ .SH HISTORY
>  was invented by POSIX.
>  See also
>  .BR accept (2).
> +.PP
> +These structures were invented before modern ISO C strict-aliasing rules.
> +If aliasing rules are applied strictly,
> +these structures would be impossible to use

Maybe "extremely difficult" instead of "impossible" to use (if I
understand this thread correctly, it is possible to memcpy() from one
struct into different storage of a different effective type where the
memcpy()'s intermediate aliasing through char* avoids the UB).

> +without invoking Undefined Behavior (UB).
> +POSIX Issue 8 will fix this by requiring that implementations
> +make sure that these structures
> +can be safely used as they were designed.
>  .SH NOTES
>  .I socklen_t
>  is also defined in
> 
> 
> I guess this is simple enough that it should work as documentation.

It seems fine from my perspective.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [PATCH] sockaddr.3type: Document that sockaddr_storage is the API to be used

2023-04-06 Thread Eric Blake via austin-group-l at The Open Group
On Thu, Apr 06, 2023 at 02:05:15PM -0400, Zack Weinberg wrote:
> On Thu, Apr 6, 2023, at 12:31 PM, Alejandro Colomar via Libc-alpha wrote:
> > On 4/6/23 18:24, Eric Blake wrote:
> >> here's the updated wording that the Austin Group tried today (and we
> >> plan on starting a 30-day interpretation feedback window if there are
> >> still adjustments to be made to the POSIX wording):
> >>
> >> https://austingroupbugs.net/view.php?id=1641#c6255
> >
> > Thanks!  That wording (both paragraphs) LGTM.
> 
> If I could suggest an additional change, the focus on aliasing
> _diagnostics_ rather misses the point IMHO.  We don't just want the
> compiler to _not complain_ about accesses to sa_family_t, we want it to
> treat the accesses as _legitimate_.  So, instead of
> 
> # Additionally, the structures shall be defined in such a way that
> # these casts do not cause the compiler to produce diagnostics about
> # aliasing issues in accessing the sa_family_t member of these
> # structures when compiling conforming application (xref to XBD section
> # 2.2) source files.
> 
> may I suggest wording along the lines of
> 
> # Additionally, the structures shall be defined in such a way that
> # the compiler treats an access to the stored value of the sa_family_t
> # member of any of these structures, via an lvalue expression whose type
> # involves any other one of these structures, as permissible, despite the
> # more restrictive rules listed in ISO C section 6.5p7.

I like it as an improvement; I've added your suggestion to the POSIX
bug report as one of the comments received during the 30-day
interpretation window, to see what the other standards developers
think.

Since Issue 7 is tied to C99, and Issue 8 will be tied to C17, both of
which use the same section number despite being a different edition of
the C standard, being that specific may work.  Or, we might try
something focusing more on wording instead of document location, as
in:

Additionally, the structures shall be defined in such a way that the
compiler treats an access to the stored value of the sa_family_t
member of any of these structures, via an lvalue expression whose type
involves any other one of these structures, as permissible even if the
types involved would not otherwise be deemed compatible with the
effective type of the object ultimately being accessed.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Austin Group questions on iconv()

2023-03-09 Thread Eric Blake via austin-group-l at The Open Group
In today's Austin Group meeting, the folks discussing POSIX had a
question for Bruno and/or anyone else with an idea on how the
standards should approach a difference in behavior between Solaris and
GNU iconv() implementations.

For context, today's meeting minutes:
https://posix.rhansen.org/p/2023-03-09 around line 1635

and the bugs leading to the question:

https://austingroupbugs.net/view.php?id=1635
 "0001635: iconv: please be more explicit in input-not-convertible case"
 still open - iconv() resulting in EILSEQ not because of input
 encoding error but because of output being unable to encode the
 transliteration

https://austingroupbugs.net/view.php?id=1007
 "0001007: iconv function not allowed to fail to convert valid sequences"
 resolved at https://austingroupbugs.net/view.php?id=1007#c3330,
 standardizing the //IGNORE, //TRANSLIT, and //NON_IDENTICAL_DISCARD
 modifiers

It seems that bug 1635 is saying that the Solaris implementation
provides a conversion that application writers can use to get reliable
output but does not provide some desired features, and the standard
should change to acknowledge that the GNU implementation provides some
of those desired features.  However, the GNU implementation includes
some ambiguities that make it unreliable.  It seems to ask us to
change the standard to allow a modified version of the GNU iconv()
function that could be reliably interpreted by an appication writer.
For example, overloading EILSEQ to mean that there was an invalid
character in the input stream or that there was no transliteration
available in the output codeset to convert that input character makes
it impossible for an application to determine which of those two
problems caused iconv() to fail.

Can we get an explanation on how an application writer is supposed to
write code to reliably use the iconv() in GNU libc, given the above
example?  Can we get help in identifying exactly what changes need to
be made to POSIX (after bugid:1007 has been integrated) to allow GNU
behavior and get reliable results without breaking applications that
currently work with the Solaris iconv() interface.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2008)/Issue 7 0000561]: NUL-termination of sun_path in Unix sockets

2022-11-30 Thread Eric Blake via austin-group-l at The Open Group
On Wed, Nov 30, 2022 at 08:54:03AM -0600, Eric Blake via austin-group-l at The 
Open Group wrote:
> >  ...
> >  |https://austingroupbugs.net/view.php?id=561 

> 
> First, I chose that wording because 'sizeof(struct
> sockaddr_un.sun_path)' doesn't compile.  You are right that 'sizeof
> NAME.sun_path' does compile, if NAME is an expression of type struct
> sockaddr_un, but the sentence becomes longer to introduce some object
> named NAME of the correct type just to get to the shorter sizeof
> expression.  However, we can make that edit if it makes sense.

Having written that, I did test that 'sizeof(((struct
sockaddr_un*)0)->sun_path)' compiles with gcc, although I'm less
certain of whether the C standard permits that (or even if that
permission has changed over time) - the expression argument to sizeof
is unevaluated, which counters the argument that you can't normally
evaluate a dereference of a NULL pointer.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2008)/Issue 7 0000561]: NUL-termination of sun_path in Unix sockets

2022-11-30 Thread Eric Blake via austin-group-l at The Open Group
On Mon, Nov 28, 2022 at 07:30:36PM +0100, Steffen Nurpmeso via austin-group-l 
at The Open Group wrote:
> Austin Group Bug Tracker wrote in
>  :
>  ...
>  |https://austingroupbugs.net/view.php?id=561 
>  ...
>  |-- 
>  | (0006085) geoffclare (manager) - 2022-11-28 16:24
>  | https://austingroupbugs.net/view.php?id=561#c6085 
>  |-- 
>  ...
>  |char sun_path[size]   Socket pathname
>  |storage.
>  ...
>  |[.] However, because sun_path is required to be the
>  |last member of the struct, an application can deduce the size by using
>  |sizeof(struct sockaddr_un) - offsetof(struct sockaddr_un,
>  |sun_path).
> 
> I am glued to old habits, but given it is the last field and of
> a known fixed size sizeof(NAME.sun_path) should be all that is
> necessary.  (It definitely is in practice.)
> (And all this different to SUN_LEN(), of course.)

Two comments in response:

First, I chose that wording because 'sizeof(struct
sockaddr_un.sun_path)' doesn't compile.  You are right that 'sizeof
NAME.sun_path' does compile, if NAME is an expression of type struct
sockaddr_un, but the sentence becomes longer to introduce some object
named NAME of the correct type just to get to the shorter sizeof
expression.  However, we can make that edit if it makes sense.

Second, given alignment issues, a choice of an odd size coupled with
other members that require even alignment could permit an
implementation where sizeof(struct sockaddr_un) > offsetof(struct
sockaddr_un, sun_path) + sizeof(NAME.sun_path) due to padding bytes
added for alignment reasons.  I don't know of any such implementations
in practice (the choice of 92, 104, and 108 as the most common sizes
tends to be so that the overall struct sockaddr_un has a size of 128
bytes, which is a nice power-of-two boundary).  Then again,
intentionally forcing struct sockaddr_un to have a padding byte after
sun_path might be an implementation's way of guaranteeing that it can
handle a NUL byte even if the application didn't pass one in.
Therefore, do we need to modify the wording in this proposal to ensure
that struct sockaddr_un is not allowed to have padding bytes after
sun_path to match existing practice?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2016/18)/Issue7+TC2 0001457]: Add readlink(1) utility

2022-07-22 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Jul 22, 2022 at 05:04:09PM +0100, Jonathan Wakely wrote:
> On Fri, 22 Jul 2022 at 15:53, Robert Elz via austin-group-l at The
> Open Group  wrote:
> > Aside from that possibility the only reason would seem to be the same
> > as why echo (real ones) have -n (and trashy ones have \c) and why
> > printf(1) needs a \n to print one ... there are times that it is useful
> > to write a partial line to stdout (or wherever) and there's no reason
> > that the output of readlink could not be intended to be a part of such
> > a gradually constructed output line.
> 
> But then shouldn't *every* command that prints output have a -n option?
> 
> If you need to include the output of readlink in gradually constructed
> output you can do what you have to do with other commands:
> 
> printf '%s' "$(readlink foo)"

That strips trailing newlines that may have been important.  The link
contents $'abc' and $'abc\n' are indecipherable under your approach of
a path through $() and printf.  If you are going to output a
constructed filename to stdout, you really DO want:

readlink -n foo && echo /newfile

to produce the output "link/content/newfile" when foo contains
'link/content', and still handle the case where foo's content is
instead something with a trailing newline.

> 
> The fact that echo and printf have that feature means you don't need
> it everywhere.

You don't need it for utilities that are seldom used in generating
partial file names; but for programs like dirname and readlink,
providing a simpler way to use the utility in the context of building
up a larger file name without losing intermediate trailing newlines
that would be eaten by $() is enough of a worry that adding things
like -n to make it more useful was worthwhile to the implementors.
I'm aware that 'dirname -n' is not common implementation practice, but
since 'readlink -n' does appear to be, there's no harm in
standardizing it that way.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2016/18)/Issue7+TC2 0001457]: Add readlink(1) utility

2022-07-22 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Jul 22, 2022 at 09:26:45AM +0200, Quentin Rameau via austin-group-l at 
The Open Group wrote:
> Hello,
> 
> > == 
> > https://austingroupbugs.net/view.php?id=1457 
> > == 
> 
> > == 
> > Summary:Add readlink(1) utility
> > == 
> 
> > -nDo not output a trailing 
> > character.
> 
> Out of curiosity, what's a use-case for that?

Good question.  My initial thought was that the construct:

  var=$(readlink -- "$name")

will NOT assign var to the correct contents if $name is a symlink that
resolves to a string containing trailing newlines, as $() would strip
not only the newline added by readlink, but also the newlines from the
link contents.  But using:

  var=$(readlink -n -- "$name")

will not fare any better; it will also strip trailing newlines from
the link content.  The only reliable way to accurately capture the
contents of a symlink in a shell variable is to do something like:

  tmp=$(readlink -n -- "$name"; printf .)
  var=${tmp%.}

at which point the addition of -n doesn't really help, because you
could also do:

  tmp=$(readlink -- "$name"; printf .)
  var=${tmp%?.}

with fewer characters typed.

So the only actual answer I can come up with is "existing practice in
readlink implementations in the wild", where we'd have to ask the
program designers why they thought -n was useful.

[If readlink is implemented as a shell builtin, then you could have an
extension where:

  readlink -v var -n -- "$name"

assigns $var to the full symlink contents, without any extra or
stripped newlines, but such an extension is not what we are proposing
to standardize]

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: Can struct sockaddr_un.sun_path be a flexible array member?

2022-07-20 Thread Eric Blake via austin-group-l at The Open Group
On Sun, Jul 17, 2022 at 03:46:52PM -0700, Nick Stoughton via austin-group-l at 
The Open Group wrote:
> Note that a flexible array member is not the same thing as a variable
> length array, and although both entered the standard in C99, previous
> versions allowed the FAM to be specified as an array of length 0.
> 
> The C standard notes that:
> > In most situations, the flexible array member is ignored. In particular,
> the size of the structure is as if the flexible array member were omitted
> ...
> and "sizeof" does just that (omits the flexible array member).
> 
> The normative text does not seem to preclude the use of a flexible array
> member but does not specify any mechanism to obtain the size if it were so.
> I believe that it is a bug in the standard that it is not made clearer that
> the implementation should define the size somehow. I know of no
> implementation that uses a flexible array here. Please feel free to submit
> a bug to austingroupbugs.net with this.

Or better yet, help with amending the existing bug to propose the
desired wording changes:

https://www.austingroupbugs.net/view.php?id=561

Based on an earlier meeting, our current thoughts are:

- Add requirement that sun_path be last member of struct sockaddr_un,
and that it have a constant (although unspecified) size rather than
being an open array

- Add application usage to functions dealing with sockname to
recommend memory > sizeof(struct sockaddr_un) preinitialized to 0 when
it is desired to ensure NUL termination

- Leave SUN_LEN out of the standard; we don't want variable-length
sun_path

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: Latest on POSIX efforts to standardize gettext

2022-05-09 Thread Eric Blake via austin-group-l at The Open Group
On Thu, May 05, 2022 at 09:31:41AM -0500, Eric Blake via austin-group-l at The 
Open Group wrote:
> Hello GNU and Illumos folks,
> 
> The Austin Group (those in charge of the POSIX specification) have
> been working on a draft to incorporate the gettext(3) family of
> functions and related gettext(1) utilities into the next revision of
> POSIX (per https://austingroupbugs.net/view.php?id=1122).  After
> several months of near-weekly conference calls, the latest draft of
> the work has finally reached the point where it is ready for more
> thorough analysis by a wider group of readers.  You can view the
> current state of the draft here:
> 
> https://posix.rhansen.org/p/gettext_draft

Another question came up today (line 1172 in the draft at the time I
wrote this email).  Given the following test file test.c:

#include 
#include 
int main(){
  printf("%s\n",dgettext("foobar","test"));
}

Running "xgettext test.c", on Solaris, the resulting .po file is
called "foobar.po" and contains the msgid "test". Running it on GNU,
the resulting .po file is called "messages.po" and there is no
indication that the msgid belongs to "foobar". According to the L18nux
specification, the Solaris behavior is intended. Why does GNU xgettext
deviate?

Knowing whether this is considered a bug that future GNU xgettext will
fix, vs. intentional behavior that the standard should purposefully
not constrain, can impact what wording is chosen for the standard here.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Latest on POSIX efforts to standardize gettext

2022-05-05 Thread Eric Blake via austin-group-l at The Open Group
Hello GNU and Illumos folks,

The Austin Group (those in charge of the POSIX specification) have
been working on a draft to incorporate the gettext(3) family of
functions and related gettext(1) utilities into the next revision of
POSIX (per https://austingroupbugs.net/view.php?id=1122).  After
several months of near-weekly conference calls, the latest draft of
the work has finally reached the point where it is ready for more
thorough analysis by a wider group of readers.  You can view the
current state of the draft here:

https://posix.rhansen.org/p/gettext_draft

In particular, this draft has an action item to me to reach out to you
on the following question (currently found at line 1138 of that
document, or search for "A.I."):

In the msgfmt(1) utility, there is currently a difference between GNU
and Illumos implementations on detecting duplicate msgid strings, and
which command line switch(es) make detection of duplicates possible.
The question is whether GNU msgfmt would be willing to use the current
-c option (--check) have a mode for erroring out on duplicate msgid
strings, or even adding a new command line option (-n appears to be
available, for a mnemonic of 'no dupes') to have the duplicate
detection available without requiring -c.

In addition to answering that question, any review of the rest of the
proposed wording (particularly anything that is still colored and thus
represents edits since the last time we asked for review) is still
appreciated.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Issue 8 drafts 0001556]: clarify meaning of \n used in a bracket expression in a sed context address or s-command

2022-04-25 Thread Eric Blake via austin-group-l at The Open Group
Adding bug-...@gnu.org into this conversation.

On Mon, Apr 25, 2022 at 02:50:22AM +0200, Christoph Anton Mitterer via 
austin-group-l at The Open Group wrote:
> Hey.
> 
> Geoff, I haven't had time yet to look at your updated proposal of
> #1550, not sure whether I manage to do it this night or in the next
> days.
> But I'll definitely reply, so please be a bit more patient. :-)
> 
> 
> However, on thing came to my minds again, which I think needs further
> discussion...
> 
> 
> 
> The current "solution" to a number of previous problems is:
> 
> Inside a bracket expression there cannot be any escape sequences.
> Therefore, there cannot be any \n (in the sense of ) nor any
> \c (in the sense of "un-delimitering" the delimiter character c).
> 
> 
> While this is per se perfectly valid (and solves numerous issues), it
> has one problem:
> 
> (at least) GNU sed breaks it already!
> 
> 
> 
> As you noted yourself in
> https://www.austingroupbugs.net/view.php?id=1556#c5621
> 
> it requires POSIXLY_CORRECT=1 to work as it should.
> 
> $ printf 'a\\b\n' | sed 's/a[\n]b/X/'
> a\b
> $ printf 'a\nb\n' | sed 's/a[\n]b/X/'
> a
> b
> $ printf 'a\nb\n' | sed -z 's/a[\n]b/X/'
> X
> $ printf 'anb\n' | sed 's/a[\n]b/X/'
> anb
> $ export POSIXLY_CORRECT=1
> $ printf 'a\\b\n' | sed 's/a[\n]b/X/'
> X
> $ printf 'a\nb\n' | sed 's/a[\n]b/X/'
> a
> b
> $ printf 'a\nb\n' | sed -z 's/a[\n]b/X/'
> a
> b
> $ printf 'anb\n' | sed 's/a[\n]b/X/'
> X
> $ 
> 
> 
> NOT so for GNU's extension of '\s':
> '\s'
>  Matches whitespace characters (spaces and tabs).  Newlines
>  embedded in the pattern/hold spaces will also match...
> (and I assume neither for any similar such extensions):
> 
> $ printf 'asb\n' | sed 's/a[\s]b/X/'
> X
> $ printf 'a\\b\n' | sed 's/a[\s]b/X/'
> X
> $ printf 'a b\n' | sed 's/a[\s]b/X/'
> a b
> $ export POSIXLY_CORRECT=1
> $ printf 'asb\n' | sed 's/a[\s]b/X/'
> X
> calestyo@heisenberg:~$ printf 'a\\b\n' | sed 's/a[\s]b/X/'
> X
> calestyo@heisenberg:~$ printf 'a b\n' | sed 's/a[\s]b/X/'
> a b
> $
> 
> 
> It also works as expected for escaped delimiter characters:
> $ printf 'aDb\n' | sed 'sDa[\D]bDXD'
> X
> $ printf 'a\\b\n' | sed 'sDa[\D]bDXD'
> X
> 
> even when the delimiter char has also special meaning when escaped (as
> with '\s'):
> $ printf 'asb\n' | sed 'ssa[\s]bsXs'
> X
> $ printf 'a\\b\n' | sed 'ssa[\s]bsXs'
> X
> $ printf 'a b\n' | sed 'ssa[\s]bsXs'
> a b
> 
> 
> (all the above with GNU sed 4.8).
> 
> 
> So the only problematic case seems to be '\n'.
> 
> 
> 
> I don't want to step on anyone's toes... but GNU sed is probably one of
> the (if not the) major implementation of sed, isn't it?
> 
> 
> And regardless of POSIXLY_CORRECT, the standard describes now a
> behaviour (namely that the bracket expression [\n] is the literal
> characters '\' or 'n' and *not* )... which is not shared by a
> major implementation, at least not with its default settings.
> 
> Anyone who reads the standard would assume that [\n] is not a
> . 
> And of course we could just say "well your implementation is not
> compliant" or "look at it's documentation, where it says about
> POSIXLY_CORRECT" ... but that doesn't seem so good to me.
> 
> Usually, implementations extend POSIX rather gracefully, but this is a
> more serious deviation.
> 
> 
> I mean should we just leave it at that?
> 
> Or should we add some hint, e.g. indicating that portable applications
> should not use '\n' but rather 'n\' ... or perhaps even generally place
> '\' last in the bracket expression?
> 
> 
> The best would of course be to get GNU change it's behaviour, though I
> have no idea how likely that is ;-)
> 
> I had tried to reach out to GNU and BusyBox sed maintainers before, and
> while I got replies from BusyBox' I couldn't get in touch with GNU's.
> 
> Is there anyone who's in contact with these people?

The GNU sed developers can be reached at bug-...@gnu.org (per the
output of 'sed --help', and as done in this email).

So if I'm restating your complaint correctly, you are worried that GNU
sed's non-POSIX behavior (what you get by default when POSIXLY_CORRECT
is not set) treats the four-byte sequence '[\n]' in an s-command regex
as a bracket expression for the single character of a literal newline
(that is, interpreting \n as an escape sequence even though it is
inside a bracket expression), instead of as a bracket expression for
either of a literal backslash or literal n; but concur that its
behavior when being POSIX-compliant matches the POSIX rules.

POSIX can't control what GNU sed does when in non-POSIX mode

Re: how do to cmd subst with trailing newlines portable (was: does POSIX mandate whether the output…)

2022-02-08 Thread Eric Blake via austin-group-l at The Open Group
 filtering out any who really are considered as that.
> 
> That gave quite some matches:
> BRF.gz: /x2e BRAILLE PATTERN DOTS-46
> BRF.gz: /x2f BRAILLE PATTERN DOTS-34
> EBCDIC-AT-DE-A.gz: /x2e ACKNOWLEDGE (ACK)
> EBCDIC-AT-DE-A.gz: /x2f BELL (BEL)

charmaps are useful to iconv in converting file contents between more
encodings that are possible than what is permitted in locales.

> IBM918.gz: /x2f BELL (BEL)
> INIS-CYRILLIC.gz: /x2e RIGHTWARDS ARROW
> INIS-CYRILLIC.gz: /x2f INTEGRAL
> ISO_10646.gz: /x01/x2ELATIN CAPITAL LETTER I WITH OGONEK
> ISO_10646.gz: /x01/x2FLATIN SMALL LETTER I WITH OGONEK
> ISO_10646.gz: /x04/x2ECYRILLIC CAPITAL LETTER YU
> ISO_10646.gz: /x04/x2FCYRILLIC CAPITAL LETTER YA
> ISO_10646.gz: /x06/x2EARABIC LETTER KHAH
> ISO_10646.gz: /x06/x2FARABIC LETTER DAL
> ISO_10646.gz:/x1E/x2ELATIN CAPITAL LETTER I WITH DIAERESIS 
> AND ACUTE
> ISO_10646.gz:/x1E/x2FLATIN SMALL LETTER I WITH DIAERESIS AND 
> ACUTE
> ISO_10646.gz: /x22/x2ECONTOUR INTEGRAL
> ISO_10646.gz:/x25/x2EBOX DRAWINGS RIGHT HEAVY AND LEFT DOWN 
> LIGHT
> ISO_10646.gz:/x25/x2FBOX DRAWINGS DOWN LIGHT AND HORIZONTAL 
> HEAVY
> ISO_11548-1.gz: /x2e BRAILLE PATTERN DOTS-2346
> ISO_11548-1.gz: /x2f BRAILLE PATTERN DOTS-12346
> JIS_C6220-1969-JP.gz:   /x2EKATAKANA LETTER 
> SMALL YO
> JIS_C6220-1969-JP.gz:   /x2FKATAKANA LETTER 
> SMALL TU
> 
> Since all these (well except perhaps ISO_10646) use 0x2E and 0x2F for
> other characters than . and /  ... doesn't that already mean that
> they're invalid with respect to POSIX?

Not quite.  You didn't ALSO check whether those charmaps define
 as something that overlaps with a multibyte character.  But
you are right that there are some charmaps which iconv can support but
which cannot be used as a locale in a given POSIX environment.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2016/18)/Issue7+TC2 0001440]: Calling `system("-some-tool")` fails (although it is a valid `sh` command)

2021-11-01 Thread Eric Blake via austin-group-l at The Open Group
On Sat, Oct 30, 2021 at 08:21:55PM -0400, Wayne Pollock via austin-group-l at 
The Open Group wrote:
> Is it guaranteed that on conforming systems nohup (and friends) must not 
> accept or
> delete the first "--"?  For the example to work, nohup must not discard the 
> "--".
> But might it?

I'm not sure why you claim nohup would not work if it discards "--".

Just because the standard does not require nohup to accept options
does not mean that implementations cannot have options as an
extension.

> 
> Section 1.4 "Utility Description Defaults" of the Introduction states
> "... Default Behavior: When this section is listed as "None.", it means that 
> the
> implementation need not support any options. Standard utilities that do not 
> accept
> options, but that do accept operands, shall recognize "--" as a first 
> argument to be
> discarded. ..."
> 
> And nohup fits that description; its OPTIONS section is listed as "None".

Correct, and that text does not need changing.  As you correctly
quoted, that means that nohup MUST accept and discard an initial "--",
the same as basename (another utility where I have seen the common bug
of handling -- incorrectly in some implementations).  If you want to
invoke another app that may begin with "-", or if you want to ensure
that a later "--" is passed to the utility itself regardless of
whether nohup has the (non-standard) extension of reordering options
after arguments, you can always write:

nohup -- $utility -- $non_option

And a quick test demonstrates that at least GNU Coreutils' nohup is
compliant (it supports long options, which are already an extension to
the standard, but not short options; but it does honor -- for
attempting to execute $utility that may begin with -):

$ POSIXLY_CORRECT=1 nohup -- printf -- abc 2>/dev/null | cat
abc
$ POSIXLY_CORRECT=1 nohup printf -- abc 2>/dev/null | cat
abc
$ nohup --version | head -n1
nohup (GNU coreutils) 8.32
$ nohup -- --version
nohup: ignoring input and appending output to 'nohup.out'
nohup: failed to run command '--version': No such file or directory
$ rm nohup.out
$ 

> Maybe nohup needs to be among the utilities that do not recognize "--".

No. While we are explicit that echo is one of the few apps needing an
exception to not recognize "--", that exception does NOT need to apply
to nohup.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: Interpretation starting for a 30 day review (1440)

2021-10-29 Thread Eric Blake via austin-group-l at The Open Group
On Sat, Oct 30, 2021 at 12:46:55AM +0700, Robert Elz via austin-group-l at The 
Open Group wrote:
> Date:Fri, 29 Oct 2021 10:00:04 -0700
> From:Nick Stoughton 
> Message-ID:  
> 
> 
>   | Just for reference, the C standard says:
> 
> Thanks, it was a little hard to imagine just how they would be
> able to (with a straight face) talk about args to "sh" ...
> 
>   | So I agree, we should change the wording here so that for Issue 7 we only
>   | state what implementations should expect to do when Issue 8 comes out, and
>   | give application developers strong warnings about how to work around the
>   | issues caused by the possible (certain?) loss of the '--' in existing
>   | implementations.
> 
> If there was going to be a new Issue 7 rev, before Issue 8, that would
> perhaps be a plausible approach - but unless something has changed, and
> Issue8 is not to be the next version released, that doesn't really work.

Another thing to consider: if enough implementations fix things NOW to
use "--" in system() and popen(), then by the time we actually DO
release Issue 8, it will already be common enough practice to
standardize it.  But I also agree with your argument that at a bare
minimum, we owe the reader some Rationale text explaining that older
versions of the standard did not require sane behavior for arguments
starting with '-' or '+', and that applications can always space-stuff
their commands to ensure desired behavior regardless of whether the
underlying implementation has Issue7 or Issue8 semantics (if we go
ahead and require "--" in Issue8).

At any rate, I've now filed a glibc bug, so we'll see what other libc
authors think about both the POSIX bug and your reaction about it
being premature to standardize a requirement of "--" (vs. just merely
recommending it and documenting what portable apps must do in the
meantime).

https://sourceware.org/bugzilla/show_bug.cgi?id=28519

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Question regarding gettext behavior on iconv failure

2021-05-03 Thread Eric Blake via austin-group-l at The Open Group
Hello GNU gettext maintainers,

In today's Austin Group meeting, we developed an example of using the
proposed POSIX standardization of gettext() and encountered a situation
where we felt that GNU gettext may have a bug.  For context, the entire
example is at:
https://posix.rhansen.org/p/gettext_split

The example in question set up several .po files and a specific
environment to test various pluralization/transcoding fallbacks, and
concludes with a snippet where a string with an encoding error in
ISO-8859-1 is output in spite of an iconv failure, rather than the
string passed in to ngettext():


n_recipients = 1;
// The following outputs "1 Empfänger" encoded in UTF-8:
printf("%s\n", ngettext("recipient", "recipients", n_recipients));

bind_textdomain_codeset("mail", "ASCII");

n_recipients = 1;
// The following outputs "recipient" with the same encoding as the
"recipient"
// argument to ngettext (remember, the the system is assumed to not
support
// conversion from ISO/IEC 8859-1 to ASCII):
printf("%s\n", ngettext("recipient", "recipients", n_recipients));
// On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e.
no conversion is done). I think we already agreed on considering this
behavior a bug,

This raises a few questions: does the GNU gettext team agree that this
can be considered a bug, and if so, will a future gettext release behave
differently?  Or if it is intentional and not a bug, can you provide
justification for the behavior as well as tweaks to the proposed
standard wording for gettext requirements and the worked example?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Eric Blake via austin-group-l at The Open Group
On 3/9/21 1:34 PM, Eric Blake via austin-group-l at The Open Group wrote:
> On 3/9/21 10:14 AM, shwaresyst wrote:
>>
>> To me that looks like a conformance violation and should be reverted. There 
>> is no _SC_SIGSTKSZ defined in  by the standard, to begin with, so 
>> that use of sysconf() is a non-portable extension on its own.
> 
> Portable apps can't use _SC_SIGSTKSZ, but the standard generally permits
> implementations to define further constants.  Then again, re-reading XSH
> 2.2.2:
> 
> " Implementations may add symbols to the headers shown in the following
> table, provided the identifiers for those symbols either:
> 
> Begin with the corresponding reserved prefixes in the table, or
> ..."
> 
> but the table lacks a row for  with _CS_* and _SC_* constants.
>  Looks like you found an independent defect.

Not quite, because later it states "The following identifiers are
reserved regardless of the inclusion of headers: 1. With the exception
of identifiers beginning with the prefix _POSIX_, all identifiers that
begin with an  and either an uppercase letter or another
 are always reserved for any use by the implementation.", so
an implementation can blindly add _SC_* constants at will without
violating the standard.

Still, I opened:
https://www.austingroupbugs.net/view.php?id=1456
to try and add some clarification.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Eric Blake via austin-group-l at The Open Group
On 3/9/21 10:14 AM, shwaresyst wrote:
> 
> To me that looks like a conformance violation and should be reverted. There 
> is no _SC_SIGSTKSZ defined in  by the standard, to begin with, so 
> that use of sysconf() is a non-portable extension on its own.

Portable apps can't use _SC_SIGSTKSZ, but the standard generally permits
implementations to define further constants.  Then again, re-reading XSH
2.2.2:

" Implementations may add symbols to the headers shown in the following
table, provided the identifiers for those symbols either:

Begin with the corresponding reserved prefixes in the table, or
..."

but the table lacks a row for  with _CS_* and _SC_* constants.
 Looks like you found an independent defect.

> 
> I could see the definition of SIGSTKSZ being changed to the static minimum a 
> particular processor requires, or is initially allocated as a 'safe' amount, 
> rather than static "default size", and moving SIGSTKSZ to . This 
> would contrast to MINSIGSTKSZ as the lowest value for a platform for all 
> supported processors. Then an application could use sysconf() to query for 
> the maximum size the configuration supports if it wants to use more than 
> that, as a runtime increasable limit.

As I understand it, the concern in glibc is less about runtime
increasability, so much as ABI compatibility with applications compiled
against older headers at a time when the kernel had less state
information to store during a context switch.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Eric Blake via austin-group-l at The Open Group
On 3/9/21 9:26 AM, Andreas Schwab wrote:
> On Mär 09 2021, Eric Blake via Libc-alpha wrote:
> 
>> The question becomes whether glibc is in violation of POSIX for having
>> made the change, or whether POSIX needs to be amended to allow SIGSTKSZ
>> to be non-preprocessor-safe and/or non-constant.
> 
> POSIX already allows non-preprocessor-safe.

True, but expanding 'SIGSTKSZ' to 'sysconf (_SC_SIGSTKSZ)' is not a
symbolic constant., as it is not "a compile-time constant expression
with an integer type', per definition 3.380.

Looks like this discussion is happening in parallel in:
https://sourceware.org/bugzilla/show_bug.cgi?id=20305

I can open a defect against POSIX if we decide that is needed, but want
some consensus first on whether it is glibc's change that went too far,
or POSIX's requirements that are too restrictive for what glibc wants to do.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Eric Blake via austin-group-l at The Open Group
[adding glibc and Austin group lists]

On 3/6/21 12:50 PM, Bruno Haible wrote:
> Hi,
> 
> Carol Bouchard wrote in 
> <https://lists.gnu.org/archive/html/bug-m4/2021-03/msg0.html>:
>> A change that was introduced is the
>> #define SIGSTKSZ is no longer a statically defined variable.  It's value can
>> only be determined at run time.
>>
>> # define SIGSTKSZ sysconf (_SC_SIGSTKSZ)
> 
> This is invalid. POSIX:2018 [1] defines two lists of macros:
> 
>   1) "The  header shall define the following macros which shall
>   expand to integer constant expressions that need not be usable in
>   #if preprocessing directives:"
> 
>   2) "The  header shall also define the following symbolic 
> constants:"
> 
> SIGSTKSZ is in the second list. This implies that it must expand to a constant
> and that it must be usable in #if preprocessing directives.

The question becomes whether glibc is in violation of POSIX for having
made the change, or whether POSIX needs to be amended to allow SIGSTKSZ
to be non-preprocessor-safe and/or non-constant.

> 
> Besides being invalid, it is also not needed. The alternate signal stack
> needs to be dimensioned according to the CPU and ABI that is in use. For 
> example,
> SPARC processors tend to use much more stack space than x86 per function
> invocation. Similarly, 64-bit execution on a bi-arch CPU tends to use more 
> stack
> space than 32-bit execution, because return addresses and other pointers are
> 64-bit vs. 32-bit large. But once you have fixed the CPU and the ABI, there is
> no ambiguity any more.
> 
>> This affects m4 code since the code assumes a statically defined variable 
>> which
>> can be determined at preprocessor time.
> 
> POSIX guarantees this assumption.
> 
>> Please advise how I can get past this.
> 
> Fix your .

https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=6c57d320484988e87e446e2e60ce42816bf51d53
shows where glibc made the change, and I've now seen reports of several
projects failing to build when using glibc with this change included.

> 
> Bruno
> 
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2016)/Issue7+TC2 0001345]: date(1) default format

2020-07-13 Thread Eric Blake

On 7/13/20 4:07 AM, Geoff Clare wrote:

J William Piggott  wrote, on 12 Jul 2020:


On Mon, 6 Jul 2020, Geoff Clare wrote:


There is no way we are going to change the required d_t_fmt value for
the POSIX locale.


Why?


Because every implementation would have to change, and all applications
that rely on the current value would potentially break (depending on
which fields they use).

We would need a very good reason to make such a change.


Has it been discussed with 'we'? Would any of them like to comment on
this please?


I've been using "we" to refer to the whole Austin Group, i.e. everyone
on this mailing list, so yes it has been discussed (in this thread).

The three people who's opinions matter are the organisational
representatives who would vote on it if it came to that.  I'm sure if
any of them disagree with what I've said they will comment.


To make matters clearer, as one of the organisational reps (Open Group), 
I'm of the mind that mandating a change to existing practice is 
undesirable.  Standardizing a new format value (especially if there is 
existing practice to copy from) or adding better documentation to make 
it clear about the intentional differences between strftime vs. date are 
both less invasive than a mandatory change to the contents of an 
existing format value.  So I'm concurring with Geoff's handling of the 
responses so far.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2008)/Issue 7 0000411]: adding atomic FD_CLOEXEC support

2020-03-12 Thread Eric Blake

On 3/12/20 1:02 PM, shwaresyst wrote:


Fyi,  the Last updated: date at top wasn't changed.
On Thursday, March 12, 2020 Austin Group Bug Tracker  
wrote:

A NOTE has been added to this issue.



--
  (0004796) eblake (manager) - 2020-03-12 16:35
  https://www.austingroupbugs.net/view.php?id=411#c4796
--
minor tweak to the attached files to fix an instance of O_CLOEXEC that
should be SOCK_CLOEXEC in relation to accept4().


Thanks. I'll re-upload with that additional date tweak.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [1003.1(2008)/Issue 7 0000252]: dot should follow Utility Syntax Guidelines

2020-02-04 Thread Eric Blake

On 2/4/20 9:29 AM, Eric Blake wrote:

On 2/4/20 9:16 AM, Robert Elz wrote:

I am putting this in a new thread, as it isn't really important,
more just amusing, but the solution to this issue, with respect
to the "." command, is I think, causing that command to be in
violation of the standard (in a completely different way than what
the previous discussion is about).

The resolution of the issue (as was previously noted) adds the words:

 The dot special built-in shall support XBD Section 12.2 (on page 
215).





Last time I looked, "." was neither a lower case letter, nor a digit, in
any character set.

Hence, the resolution of this issue has caused a contradiction in the
standard - those guidelines are both required, and ignored, all in the
same command.

We could fix all this by changing the name of the "." command, 
probably to
"source" as that's already supported by some shells, but is this 
degree of

penantry really important, or do we just live with the standard being
inconsistent with itself?


Good catch.  However, I don't think we can require the name 'source'; 
better would be a fix along the lines of what we do for 'tail', in 
documenting explicit exceptions to the XBD guidelines.  Something like:


The dot special built-in shall support XBD Section 12.2 (on page 215), 
except that it does not comply to Guideline 1 or 2 due to its name.


or maybe

The dot special built-in shall support Guidelines 3-14 of XBD Section 
12.2 (on page 215).




For what it's worth, the standard was already self-contradictory for the 
[ utility; it is also required to support XBD Section 12.2 with an 
exception for Guideline 10; but would need a similar exemption for 
Guideline 1 and 2.  I don't think reopening bug 252 is correct, but a 
new bug fixing both '.' and '[' would be in order.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [1003.1(2008)/Issue 7 0000252]: dot should follow Utility Syntax Guidelines

2020-02-04 Thread Eric Blake

On 2/4/20 9:16 AM, Robert Elz wrote:

I am putting this in a new thread, as it isn't really important,
more just amusing, but the solution to this issue, with respect
to the "." command, is I think, causing that command to be in
violation of the standard (in a completely different way than what
the previous discussion is about).

The resolution of the issue (as was previously noted) adds the words:

 The dot special built-in shall support XBD Section 12.2 (on page 215).




Last time I looked, "." was neither a lower case letter, nor a digit, in
any character set.

Hence, the resolution of this issue has caused a contradiction in the
standard - those guidelines are both required, and ignored, all in the
same command.

We could fix all this by changing the name of the "." command, probably to
"source" as that's already supported by some shells, but is this degree of
penantry really important, or do we just live with the standard being
inconsistent with itself?


Good catch.  However, I don't think we can require the name 'source'; 
better would be a fix along the lines of what we do for 'tail', in 
documenting explicit exceptions to the XBD guidelines.  Something like:


The dot special built-in shall support XBD Section 12.2 (on page 215), 
except that it does not comply to Guideline 1 or 2 due to its name.


or maybe

The dot special built-in shall support Guidelines 3-14 of XBD Section 
12.2 (on page 215).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2008)/Issue 7 0000252]: dot should follow Utility Syntax Guidelines

2020-02-04 Thread Eric Blake
shell (and dash, bosh, and pdksh - so maybe ksh88) it works just fine.

Keeping that complicates all of this - otherwise I would simply implement
"if there is exactly one arg it must be intended to be the required file
name, whatever it looks like, otherwise..." and things would be a little
simpler.

kre

ps: if it turns out I know someone in the balloting group for final approval
of this, I'd suggest to them that they do not approve the new version while
it contains the kind of incompatibility and breakage for no particularly
good reason that seems to me to exist.


But I think you've overlooked the fact that you ARE allowed to have the 
extension behavior for all except '--' that preserves your goal of "if 
there is one argument, it is a filename even if it starts with '-'", 
while still remaining compliant to the standard's "'. -- "$arg"' must 
treat $arg as a filename even if it starts with '-'".


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: $/ as a textual exit status (Was [1003.1(2016)/Issue7+TC2 0001321]: exit status for false should be 1-125)

2020-01-31 Thread Eric Blake

On 1/31/20 10:15 AM, Joerg Schilling wrote:

Robert Elz  wrote:


 Date:Fri, 31 Jan 2020 11:43:17 +0100
 From:Joerg Schilling 
 Message-ID:  
<5e3404c5.tqmsutrzovb6+pjf%joerg.schill...@fokus.fraunhofer.de>

   | The real problem I see is that more than 30 years after waitid() has been
   | introduced to be able to return all 32 bits of the exit() call parameter,
   | bosh is still the only shell that does no longer live in the 1970s
   | with respect to exit() code handling.

To me, this tells everything.   If a function is needed, and a solution
is provided, it gets used (might take a little while, but it happens).
When a solution is provided to a problem that doesn't really exist, it
tends to simply be ignored.


This looks like a missinterpretation.

The main issue seems to be that most kernel implementations implemented
waitid() in a useless way.

This changed approx. 4 years ago, when FreeBSD fixed their waitid()...


The Linux kernel still only tracks 8 bits of information, truncating 
during _exit().  Although I have in the past raised the issue to Linux 
kernel developers that Linux' waitid() is non-compliant, no one has yet 
submitted a Linux kernel patch to update the process struct to track 32 
bits and fix _exit/waitid to expose them.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2013)/Issue7+TC1 0001045]: Issues with "cd -"

2019-10-23 Thread Eric Blake

On 10/23/19 9:49 AM, Geoff Clare wrote:

Hi Konrad.

The status changing to APPLIED means that the edits have now been made
to the troff source of the standard.

Although that doesn't mean they are set-in-stone, we would need a good
reason to reopen the bug to change the resolution, and then update
the troff to reflect the new resolution.

As regards removing the "" case from the example, the parenthetical
note after the code explains why that is there.


The proposal was not to delete the "" case, but...



case $dir in
 (/*) CDPATH= cd -P "$dir";;
 ("") CDPATH= cd -P "";;
 (*) CDPATH= cd -P "./$dir";;
 esac

be shortened to

case $dir in
 (/*|) CDPATH= cd -P "$dir";;


to condense two cases into one.  Except that it uses the wrong syntax; 
the correct spelling would be (/*|'') (or using more spacing, '( /* | '' 
)').


I don't have any qualms with condensing the example from a technical 
standpoint (if done correctly), but question whether it counts as a mere 
editorial change worth making this late in the process for this bug.



 (*) CDPATH= cd -P "./$dir";;
 esac

?

Also, from a usability perspective, I think it would be better if `-' lost its 
special meaning after `--'.  This would make the above code superfluous.


Coding that up would be at odds with existing practice, so even if we 
were to choose that way if designing from scratch, I don't think we can 
make that change now.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: Draft minutes of the 5th August 2019 Teleconference

2019-08-07 Thread Eric Blake
On 8/7/19 4:43 PM, enh wrote:
> What's the plan for the qsort_r interface, given that glibc and BSD have
> mutually incompatible ones (which is why I didn't add it to Android)?

Per http://austingroupbugs.net/view.php?id=900#c4112, FreeBSD was
planning to switch over to the glibc signature, making it easier to
standardize things as 'qsort_r' as presented in the bug, rather than as
'posix_qsort_r'.  But as there is still a 30-day window for Open Group
objections, we may very well receive an objection to the name 'qsort_r'
where we would have to go with 'posix_qsort_r'.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: Draft minutes of the 5th August 2019 Teleconference

2019-08-07 Thread Eric Blake
On 8/6/19 4:48 AM, Geoff Clare wrote:
> These are the draft minutes from yesterday's call.  Andrew will need
> to allocate the Austin-xxx document number and add the file to the
> document register after he returns.
> 

> 
> Minutes of the 5th August 2019 Teleconference Austin-xxx Page 1 of 1
> Submitted by Geoff Clare, The Open Group. 6th August 2019

Followup:


> Bug 1220: Add an API to query the name of a locale category of a locale 
> object OPEN
> http://austingroupbugs.net/view.php?id=1220
> 
> Action: Eric to ask if The Open Group is willing to sponsor this interface.
> 
...
> 
> Bug 1263: Add ppoll()  OPEN   
> http://austingroupbugs.net/view.php?id=1263
> 
> Action: Eric to ask if The Open Group is willing to sponsor this interface.

Now complete, along with earlier actions to ask about sponsorship of
qsort_r in bug 900 and reallocarray in bug 1218.  I proposed a 30 day
window for any comments or objections, and will follow up in early
September (with the assumption that no objections is tacit approval that
we proceed with the new interfaces).


> Bug 374: malloc(0) and realloc(p,0) must not change errno on success   OPEN
> http://austingroupbugs.net/view.php?id=374
> 
> Geoff had noticed an overlap between changes suggested in this open bug
> and the changes needed to align with C17.
> 
> We also noted that glibc does not conform to the change we made in
> 2008-TC1 to require that errno is set to an implementation-defined
> value if realloc(p,0) returns null.  This matches the change made in
> C17 7.22.3.1 (overview) which says that if a null pointer is returned in
> the size 0 case it is "to indicate an error".  However, 7.22.3.5 (realloc)
> still says "If size is zero and memory for the new object is not
> allocated, it is implementation-defined whether the old object is
> deallocated" and "The realloc function returns a pointer to the new
> object [...], or a null pointer if the new object has not been allocated"
> which seems to imply a null pointer can be returned in this case without
> it being considered an error.
> 
> Action: Eric to ask about this on the glibc mailing list.

Also done; Florian Weimer has replied to the bug in note 4510, and in
fact,...

> 
> Action: Nick to draft a Clarification Request to WG14.

...says he already raised a similar question to WG14 in May 2018
(although I do not have a URL handy to that thread).

In fact, the call to standardize reallocarray() may also want to depend
on the outcome here.  http://austingroupbugs.net/view.php?id=1218

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: x[ as first word in sh

2019-07-29 Thread Eric Blake
On 7/29/19 1:08 PM, Stephane Chazelas wrote:
> That's a follow-up on
> https://www.mail-archive.com/bug-bash@gnu.org/msg23451.html
> 
> Is there anything in the POSIX spec that allows:
> 
> x[ foo
> 
> To be interpreted as anything other than invoking the "x["
> command with "foo" as argument?
> 
> I had the vague recollection that there was but I can't find it
> now. If there's not, that would be a bug in the spec as several
> shells including ksh, bash, zsh, yash treat it as the start of
> an array element assignment (like in:
> 
> x[ foo
> + 1]=value

Ouch. I think you've identified a real problem.

In XSH 2.13.3, we explicitly added wording to allow unmatched unquoted
'[' in a word to be used in its role similar to 'test' (the difference
being whether a later ']' argument is necessary):

If the pattern contains an open bracket ( '[' ) that does not introduce
a bracket expression as in XBD RE Bracket Expression, ...
If the pattern does not match any existing filenames or pathnames, the
pattern string shall be left unchanged.

So by that argument, if the shell parses 'x[' as a word, then because it
does not form a valid glob, it must be used unchanged as the command
name.  But that explicit wording does not cover whether 'x[' has to be
delimited as a word.

XSH 2.3 states in rule 7 that an unquoted blank ends the delimiting of
any prior word, but the behavior you are showing for shells that parse
a[b]= as an array assignment are trying to find the matching ] before
delimiting the first word (so the shell extension of array assignment is
somehow acting as a quoting context that prevents the whitespace thus
parsed from being the unquoted blank that ends the delimiting of the word).

The shell grammar, at XSH 2.10.1, allows for array assignments in rule 7b:

If the TOKEN contains an unquoted (as determined while applying rule 4
from Token Recognition)  character that is not part of an
embedded parameter expansion, command substitution, or arithmetic
expansion construct (as determined while applying rule 5 from Token
Recognition):

If the TOKEN begins with '=', then rule 1 shall be applied.

If all the characters in the TOKEN preceding the first such
 form a valid name (see XBD Name), the token
ASSIGNMENT_WORD shall be returned.

Otherwise, it is unspecified whether rule 1 is applied or
ASSIGNMENT_WORD is returned.

with the intent that a[b]= can be an ASSIGNMENT_WORD in shells with
array extensions, but can also be WORD for shells that treat it as glob
to determine a command name.  But without any explicit specification of
permitting whitespace in the array arguments, it looks like there is a
discrepancy between POSIX requirements and existing shell behavior.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [1003.1(2016)/Issue7+TC2 0001234]: in most shells, backslash doesn't have two meaning wrt pattern matching

2019-06-21 Thread Eric Blake
On 6/21/19 4:00 PM, Stephane Chazelas wrote:

>> The fact that bash 5's
>> behavior breaks as_echo in the presence of certain filenames is
>> definitely a discouraging regression; but I haven't paid enough
>> attention to the details of this thread to know if it was broken only in
>> the initial bash 5 release and since fixed in a followup patch, or if it
>> is still broken with all of Chet's current official patches applied on top.
> [...]
> 
> Chet has clarified that it was intentional and to match Geoff's
> interpretation of the standard. Chet has just mentioned he's
> added a new posixglob option (on by default) to the devel branch
> today
> (http://git.savannah.gnu.org/cgit/bash.git/commit/?h=devel=48492ffae22d692594757e53fb4580ebb1f506cf)
> which when disabled reverts to the old behaviour.

The sad part will be if the behavior controlled by 'set -o posixglob' or
'shopt -s posixglob' (I haven't yet checked which of the two means Chet
added it under) will actually be setting behavior NOT specified by
POSIX, depending on how this current thread plays out.  And even if he
leaves in a knob, I hope that the default for that knob when bash is
invoked as /bin/sh is historical behavior (what bash decides to default
to in bash mode is a different matter). But as long as it remains on
Chet's development branch and not a 5.1 release or an official patch to
5.0, there's still time for Chet to change it...

> 
> To quote two striking examples that have already been given,
> that interpretation of the standard would mean that:
> 
> pattern='\.'
> grep $pattern file
> 
> Which in all shells is documented to search for lines that
> contain a dot in "file" would now be required to instead search
> for lines that contain at least one character in "file", as \ is
> now a glob quoting operator, and \. happens to match the .
> directory entry (on those systems where . is included in the
> result of readdir() at least and with shells that don't skip .
> and .. in glob expansions).

Where's the glob character that causes $pattern to be subjected to
globbing?  Had there also been a '*', '[', or '?' in $pattern, I could
(sort of) see the logic to the unquoted $pattern being subjected to use
as a glob pattern. But when there are no globbing characters at all, why
does \. suddenly serve to cause a glob lookup (where \ is then erased by
the globbing procedure) and match '.' in the current directory?  (And
yes, this one is also confusing because of the ongoing work on the other
open bug about whether shells should be permitted to always omit '.'
from globbing, regardless of whether readdir() omitted it)

> 
> and
> 
> touch %sn
> cmd='printf %s\n'
> $cmd test
> 
> which in all shells is documented to output test would
> now be required to output testn (without newline).
> 
> That's what bash5 now implements.

And that is indeed the regression in behavior, which seems to not be the
historical practice of any earlier shell. If the standard has to permit
bash 5 behavior by leaving it unspecified, we've still rendered a lot of
existing scripts broken; better would be if we can agree on standard
wording (and Chet updates bash 5 to match) to do what has traditionally
been done of NOT globbing a sequence that does not contain '*', '[' or '?'.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [1003.1(2016)/Issue7+TC2 0001234]: in most shells, backslash doesn't have two meaning wrt pattern matching

2019-06-21 Thread Eric Blake
On 6/21/19 2:47 PM, Stephane Chazelas wrote:
> In http://austingroupbugs.net/bug_view_page.php?bug_id=1222
> I asked that POSIX *allows*, not even mandate an interface
> supported by one sh implementation and documented as such for
> over 25 years (since before the first version of the POSIX.2
> specification) that addresses that: echo -E - "$var"
> 
> That's not such a useless feature. Without it, echo can't be
> used to output arbitrary data, which is exactly what that
> autoconf as_echo is trying to do.

POSIX has already long-documented the fact that echo cannot be used to
output arbitrary data, and recommended the use of printf. autoconf's
as_echo should be viewed as a thin shim around printf these days, rather
than an attempt to portably use echo (the name as_echo hearkens back to
the days when echo was a shell builtin everywhere but printf was not, so
using echo was preferable to forking when it was possible; but while
things have evolved since then, the name stuck).  The fact that bash 5's
behavior breaks as_echo in the presence of certain filenames is
definitely a discouraging regression; but I haven't paid enough
attention to the details of this thread to know if it was broken only in
the initial bash 5 release and since fixed in a followup patch, or if it
is still broken with all of Chet's current official patches applied on top.

> 
> It was rejected (even the "just allowing it") on the ground that
> it would break existing scripts (without providing any evidence;
> no need to look now, I know it does break some).

Part of the reason for that rejection (since I remember being on that
call) was that the only example provided of 'echo -E -' not outputting
the '-' was for zsh in non-POSIX mode - but zsh is already notoriously
and intentionally non-POSIX when not in POSIX mode.  The assumption made
during the teleconference is that zsh in POSIX mode could just as easily
comply with what all other shells do in strict compliance mode of
outputting a literal '-', if zsh still wants to try for POSIX compliance
(and even that fact is less obvious, as we have not had as many comments
from zsh developers as we have had from other shells that are at least
trying to come to common grounds via POSIX).

> 
> That would break scripts that pass "-" as the *first* argument of
> *one* command (echo) and that happen to be interpreted by a shell
> that has implemented that allowed, but not required feature.
> 
> My point was so that POSIX warn people against expecting "echo
> -" to output "-" as it does not in all shells in practice.

Perhaps that point could still be made as a non-normative point in the
application usage section of echo, but if the only shell affected by the
problem is zsh in non-POSIX mode, it felt like a bit much to be added at
the time.

> 
> Instead, now, POSIX want to *mandate* not only allow a feature
> that not a single shell has done, that is not needed at all, and
> that would potentially break all scripts that pass an unquoted
> word expansion containing a backslash in *any* position, in
> *any* argument to *any* command.
> 
> Isn't there some level of double standard there?

You're reading far too much into the outcome of the current discussion.
I'm not yet convinced that POSIX is trying to mandate behavior at odds
with existing shell practice, and the various mailing list threads on
the topic are far from over.   Various proposals may have added words
that can be construed in that manner, but that does not mean that POSIX
has adopted that proposal, nor that it will do without first addressing
the problematic wording.  We intentionally did not reach a final
resolution on the backslash issue on yesterday's call because of the
continued activity on the mailing list.

And the fact that you have demonstrated several time-bombs where
existing shell scripts coupled with historical shell behaviors can
result in non-obvious changes in behavior based on the contents of the
current working directory make this an interesting problem.  But part of
the issue is coming up with acceptable wording that either permits
existing practice (at the risk of rendering common shell script examples
in the wild as tickling unspecified behaviors), or which tightens things
to be less unpredictable (even if it renders existing shells as
non-compliant).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: Resetting getopt's state

2019-01-07 Thread Eric Blake
On 12/28/18 4:50 AM, Joerg Schilling wrote:
> Simon Ser  wrote:
> 
>> Hi,
>>
>> There's currently no way to reset getopt's internal state. This means
>> you can't use getopt for two different argument vectors. Are there
>> plans to standardize a way to do so?

This topic also recently came up in the qemu mailing list, so it IS
something that the standard should consider addressing:

https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg00987.html

>>
>> The current standard says:
>>
>>> If the application sets optind to zero before calling getopt(), the
>>> behavior is unspecified.
>>
>> Many libcs allow optind to be set to zero to reset getopt. Some BSDs
>> have an additional optreset variable that can be set.

Both BSD and glibc have extensions that require tracking additional
hidden state, and thus also provide extensions for resetting that
additional hidden state (glibc with semantics that depend on reading
POSIXLY_CORRECT from the environment, and that differ based on whether
'-' or '+' was the first character in optstring; then both glibc and BSD
support an optstring of "a::" to allow optional arguments to -a). It's a
shame that glibc picked 'optind=0' and BSD picked 'optreset=1' as their
two hard-reset mechanisms.  The standard explicitly calls out 'optind=0'
as having unspecified behavior (which permits the glibc extension); the
BSD choice is a bit harder to work with (as 'optreset' is not a name
reserved for the POSIX namespace, so it shouldn't be visible to a
strictly conforming application).

There is also hidden state that MUST be tracked by getopt() in order to
properly handle merged short options (that is, if the user passes "-ab"
in argv[1], getopt() must leave optind=1 when returning 'a' even though
the second call will return 'b' instead of 'a').  That hidden state is
what Joerg mentions here:

> 
> This would be a nonstandard method that still does not address the needs for 
> getopt() as it does not allow to restore the previous state.
> 
> There however is a method in use since 30 years that is useful. It has been 
> introduced by AT to allow the Bourne Shell to use getopt() for builtin 
> commands. This method is based on an additional global integer named "_sp" 
> that is used as the index in a multi option string. Restoring the previous 
> state is needed to permit to call shell builtins in the getopts(1) parsing 
> loop.
> 
> The initial value for _sp is 1 and if the value is set to 1, this resets the 
> internal state of getopt(). Restoring the previous value allows to restore the
> previous state.

And it also means that even if optind == 2, setting optind = 1 might not
fully reset things if _sp is not currently 1

But note that the hidden state tracked by _sp is implicitly cleared any
time getopt() returns -1 (because you are no longer processing later
merged options from the same optind value) - that is, on implementations
that have _sp, note that _sp is implicitly reset to -1 after getopt()
reaches the end of options.  Thus, the end effect should be the same
whether we expose _sp for application use (preferably with a name that
is reserved by the standard rather than risking conflicts with existing
names in portable user programs), or whether we document that it IS
portable in practice to perform a SOFT reset of getopt() state by
running getopt() until it returns -1 prior to assigning optind = 1,
while still leaving the door open for hard reset if you used extensions
beyond POSIX (leading '-', leading '+', changing POSIXLY_CORRECT in the
environment, or use of '::').  Thus, I'm leaning towards writing a
defect that does just the latter (documenting optind=1 after getopt()
returned -1 as a soft reset, and leaving it as implementation extensions
for providing a hard reset), without bothering to expose _sp to
applications (although _sp can remain as one of the implementation
extensions).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [RFC/PATCH glibc 0/2] setting working dir in posix_spawn()

2018-09-10 Thread Eric Blake

On 9/9/18 3:34 PM, Florian Weimer wrote:

On 09/08/2018 12:54 AM, Eric Blake wrote:
Also, I've realized that we do NOT need 
posix_spawn_file_actions_addopenat(). The main benefit of openat() is 
that you can redirect relative file names according to an fd of your 
choice, without affecting global state. But during posix_spawn(), 
there are no other threads competing for global state (if you are 
doing a library implementation where the chdir() is done between 
fork() and exec()), so:


openat(mydir, "file", mode);

can be decomposed to:

posix_spawn_file_actions_addopen(, 5, ".", O_RDONLY|O_DIRECTORY, 0);
posix_spawn_file_actions_addfchdir(, mydir);
posix_spawn_file_actions_addopen(, 4, "file", mode, 0);
posix_spawn_file_actions_addfchdir(, 5);
posix_spawn_file_actions_addclose(, 5);


Is it possible to choose an appropriate value for the directory 
descriptor automatically?


Not that I know of. But it's tougher than it looks - my initial thought 
was "what about a magic negative number" that says to auto-allocate at 
the next free fd (the way AT_FDCWD is a magic number) - but since the 
allocation of fds is done at a later point (the posix_spawn() call) than 
the addition to file_actions (the posix_spawn_file_actions_addopen()), 
there is no way to predict WHAT that fd will actually resolve to, and 
thus no way to reuse that fd in posix_spawn_file_actions_addfchdir(), 
posix_spawn_file_actions_adddup2(), or 
posix_spawn_file_actions_addclose() as needed.  In other words, by the 
time you're using posix_spawn(), you're already stuck with having to 
micro-manage your fds - and if you want to avoid closing something 
important by accident, you practically have to do:


scratch_fd = open("/dev/null", O_RDONLY|O_CLOEXEC);
posix_spawn_file_actions_addopen(, scratch_fd, ...);
posix_spawn(, , );
close(scratch_fd);



What about support for AT_EMPTY_PATH, for upgrading an O_PATH 
descriptor?  I think this operation still needs openat.


O_PATH and AT_EMPTY_PATH are Linux/glibc extensions not in POSIX. So 
yes, they are worth thinking about in terms of what glibc should 
provide, but I'm not sure if they are sufficient on their own to require 
POSIX to worry about posix_spawn_file_actions_addopenat(), but rather 
might argue that glibc should add posix_spawn_file_actions_addopenat_np().


Or looking at it another way - I'm trying to stick to the initial 
philosophy documented in the posix_spawn() RATIONALE section (page 1457 
in the 2017 edition):



The requirements for posix_spawn( ) and posix_spawnp( ) are:
• They must be implementable without an MMU or unusual hardware.
• They must be compatible with existing POSIX standards.
Additional goals are:
• They should be efficiently implementable.
• They should be able to replace at least 50% of typical executions of fork( ).
• A system with posix_spawn( ) and posix_spawnp( ) and without fork( ) should 
be useful, at least for realtime applications.
• A system with fork( ) and the exec family should be able to implement 
posix_spawn( ) and posix_spawnp( ) as library routines.


Adding just posix_spawn_file_actions_addfchdir() is lighter-weight than 
adding posix_spawn_file_actions_addopenat(), 
posix_spawn_file_actions_fchdirat(), and others.  If you really have to 
deal with things like O_PATH or AT_EMPTY_PATH, then pre-open the fd in 
the parent and use posix_spawn_file_actions_adddup2(), rather than 
making file_actions more complicated.  And we're not trying to replace 
100% of fork/exec, but merely try to add a common-enough chdir paradigm 
to make it easier to replace the common 50%.


Also, note that http://austingroupbugs.net/view.php?id=411 is also 
somewhat relevant, which states:



At line 46976 [XSH posix_spawn_file_actions_adddup2], add a sentence:

If fildes and newfildes are equal, then the action shall ensure that
the FD_CLOEXEC flag of fildes is cleared (even though dup2( ) would
leave it unchanged).

After line 46999 [XSH posix_spawn_file_actions_adddup2], add the
following:

>

Although dup2( ) is required to do nothing when fildes and newfildes
are equal and fildes is an open descriptor, the use of
posix_spawn_file_actions_adddup2( ) is required to clear the
FD_CLOEXEC flag of fildes. This is because there is no counterpart of
posix_spawn_file_actions_fcntl( ) that could be used for clearing the
flag; it would also be possible to achieve this effect by using two
calls to posix_spawn_file_actions_adddup2( ) and a temporary fildes
value known to not conflict with any other file descriptors, coupled
with a posix_spawn_file_actions_close( ) to avoid leaking the
temporary, but this approach is complex, and risks EMFILE or ENFILE
failure that can be avoided with the in-place removal of FD_CLOEXEC.

There is no need for posix_spawn_file_actions_adddup3( ), since it
makes no sense to create a file descriptor with FD_CLOEXEC set before
spawning the child process, where that file descriptor would
im

Re: [RFC/PATCH glibc 0/2] setting working dir in posix_spawn()

2018-09-07 Thread Eric Blake

[reviving a REALLY old thread]
https://sourceware.org/ml/libc-alpha/2010-08/msg00107.html

On 08/27/2010 01:35 AM, Jonathan Nieder wrote:

(pruned cc's, +cc:libc-alpha)

Eric Blake wrote:

On 08/26/2010 12:18 AM, Jonathan Nieder wrote:



Do you think there would be any interest in a posix_spawn() variant
that takes a dir parameter?  I am imagining something like this:


Of your variants, I would most prefer:


  int posix_spawn_file_actions_addchdir(posix_spawn_file_actions_t
*file_actions, int dirfd);


Today, I just submitted http://austingroupbugs.net/view.php?id=1208, 
then in searching my mail archives, I found this related thread that 
never had a response at the time, so I'm now offering a reply.  Compared 
to my thoughts 8 years ago, my new writeup proposed


int posix_spawn_file_actions_addchdir(posix_spawn_file_actions_t 
*restrict file_actions, const char *restrict name);
int posix_spawn_file_actions_addfchdir(posix_spawn_file_actions_t 
*file_actions, int dirfd);


which is slightly different from your RFC based on my older thoughts. 
But in re-reading your email, I see that we could indeed get by with 
JUST the fchdir() signature, since chdir("foo") can generally be 
decomposed into fchdir(open("foo", O_RDONLY|O_DIRECTORY)).




Okay, here's a proof of concept (for the easy case --- a fork()-
based implementation for Linux).  Patches apply to 8b2b771^.


For that matter, it may also be worth adding
posix_spawn_file_actions_addopenat, which mirrors the recent
addition of openat() semantics.


Sounds like a good idea.  I did not try it because I did not want to
think about whether it would cause the __spawn_action struct to grow
(and if so, what ramifications that would have, if any).


Also, I've realized that we do NOT need 
posix_spawn_file_actions_addopenat(). The main benefit of openat() is 
that you can redirect relative file names according to an fd of your 
choice, without affecting global state. But during posix_spawn(), there 
are no other threads competing for global state (if you are doing a 
library implementation where the chdir() is done between fork() and 
exec()), so:


openat(mydir, "file", mode);

can be decomposed to:

posix_spawn_file_actions_addopen(, 5, ".", O_RDONLY|O_DIRECTORY, 0);
posix_spawn_file_actions_addfchdir(, mydir);
posix_spawn_file_actions_addopen(, 4, "file", mode, 0);
posix_spawn_file_actions_addfchdir(, 5);
posix_spawn_file_actions_addclose(, 5);

We don't need to add posix_spawn_file_actions_addFOO for every possible 
FOO that typically gets called between fork/exec, as long as we can 
string together enough bare components to get the feature parity within 
the single-threaded context of posix_spawn() for what is otherwise 
expensive if done in the parent as a wrapper around posix_spawn(), even 
if it adds more verbosity into the posix_spawn_* calls require to get 
the same desired effects.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Coordination on standardizing gettext() in future POSIX

2018-08-09 Thread Eric Blake

Hello GNU gettext folks,

Jörg Schilling is interested in standardizing gettext() and friends in a 
future version of POSIX (as a replacement to the hard-to-use catgets() 
that is currently standardized).  See 
http://austingroupbugs.net/view.php?id=1122


While there are probably things in GNU gettext that won't be 
standardized (for example, xgettext(1) has some long-only options, but 
POSIX will only standardize short options), it is worth coordinating the 
bare minimum set of features that are portable across GNU and other 
implementations of gettext, as well as any wording changes that need to 
be added (such as documenting thread-safety, locale interactions, 
whether bindtextdomain() can only safely be used once prior to creating 
threads, and so on) in order to actually be included in the standards.


Thus, this email is more of an introduction to make sure everyone 
interested in the project is aware of where to write/review any wording 
proposals for accomplishing the addition into POSIX.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: D1095R0/N2xxx draft 4: Zero overhead deterministic failure - A unified mechanism for C and C++

2018-08-09 Thread Eric Blake

On 08/08/2018 07:19 PM, Eric Blake wrote:

We've just had a discussion on whether standard-compliant abs() (which 
is currently undefined on INT_MIN) should be permitted and/or required 
to have well-defined behavior


I failed to provide a summary to my thoughts:

I think your paper's example should NOT use abs(), but instead some 
other function (whether you merely rename your existing example to 
'myabs', or pick a different function which DOES have well-defined errno 
semantics right now), precisely because abs() does NOT currently have 
well-defined errno semantics and it is controversial on whether such 
semantics should be given to it.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: D1095R0/N2xxx draft 4: Zero overhead deterministic failure - A unified mechanism for C and C++

2018-08-09 Thread Eric Blake

On 08/08/2018 05:24 PM, Niall Douglas wrote:


https://docs.google.com/viewer?a=v=forums=MTEwODAzNzI2MjM1OTc0MjE3MjkBMDIyMjg0NDY2NTc4NzYyMDQzODYBX1RlYjRCNjREQUFKATAuMQFpc29jcHAub3JnAXYy=0




Comments are welcome, particularly on how best to offer POSIX functions
in a form both binary compatible with old code, and which calls the
_Fails(errno) form in newly compiled code.


An initial comment in regards to the example on page 5:

1 int abs(int x)
2 {
3   if(x == INT_MIN)
4   {
5 errno = ERANGE;
6 return 0;
7   }
8   return (x < 0) ? -x : x;
9 }

We've just had a discussion on whether standard-compliant abs() (which 
is currently undefined on INT_MIN) should be permitted and/or required 
to have well-defined behavior (either in the one direction of returning 
INT_MIN, as that is the fewest assembly instructions on typical 
hardware, or in the direction of adding errno handling, as you have done 
here).  The verdict is not final (I wish I could point you to mailing 
list archives, but https://www.opengroup.org/austin/mailarchives/ points 
to gmane, which is no longer functional, and I don't know of any other 
web archival visiting the Austin list).  But so far, a rough consensus 
from the discussion on bug http://austingroupbugs.net/view.php?id=1108 
and http://austingroupbugs.net/view.php?id=1197 is that integral 
functions, like abs(), should NOT signal a range error for performance 
reason (or that setting errno to ERANGE should be a feature of 
floating-point math, not integer math), and that the wording in 1108 
will be once again relaxed to leave behavior of abs(INT_MIN) undefined, 
rather than well-defined (any specific implementation can, of course, 
define behavior as an extension to POSIX).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: sed -e 'a\' -e text

2018-08-07 Thread Eric Blake

On 08/07/2018 10:20 AM, Shware Systems wrote:

That is a bug in those shells, conformance wise. No buts. Consideration of 
quoting happens after line joining, for all forms, as noted in the sections on 
quoting and tokenization. I don't think that even qualifies as a permitted 
extension, which is partly why the $'...' form is being added; it has the \n 
escape.


Huh? The standard is clear that the following three sequences are 
identical in producing a literal backslash followed by a newline:


"\\
"
'\
'
$'\\\n'

and all shells are compliant to that. Backslash-newline line joining 
does NOT happen inside single quoting, but only inside double-quoting 
and in unquoted text.  Or more precisely, newline joining occurs when 
backslash is not quoted; when neither single- nor double-quoting is 
active, only backslash escaping can quote the backslash; when 
double-quoting is active, backslash is an escape character and is 
behaves unquoted unless backslash-escaped; but when single-quoting is 
active backslash is NOT an escape character and thus always behaves as 
quoted (cannot behave as unquoted, and therefore does not need escaping).




You need:
  -e a\bar
to disable the join now, so the character before the NL isn't a '\', for a 
conforming script. Then concatenation as part of quote removal keeps the '\' 
and NL.


You are correct that such a script results in the same input to sed, but 
the original example using just -e a\ is identical in 
behavior. Please quit spreading misinformation.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2008)/Issue 7 0000262]: sed with multiple -e options

2018-08-07 Thread Eric Blake

On 08/07/2018 10:18 AM, Joerg Schilling wrote:

Stephane Chazelas  wrote:


2018-08-07 15:46:33 +0100, Stephane Chazelas:
[...]

- or a variant thereof that covers historical implementations,
   that is same as above except that a fragment can't end in a
   backslash.

[...]

Correction: can't end in *an unescaped* backslash.

sed -e 'w file\\' -e q

(write to a file called file\) should still be OK.


OK, but please note that strings that end in an unescaped backslash are
uspecified at shell level already, so it is unspecified how the sed command is
called (unlesss you use execl()).


Huh? In shell, '\' with single-quoting is well-specified as a single 
backslash ('\\' is well-specified as two backslashes). And there is no 
way to write a double-quoted string that ends in an unescaped backslash, 
since both "\\" and "\"" are well-defined.  (It is possible to write a 
construct that results in an unterminated string starting with a double 
quote, such as:

eval echo '"'
but such constructs are already problematic for not having a terminating 
", rather than for ending in an unescaped backslash.)


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: perror() changes the orientation of stderr to byte-oriented mode if stderr is not oriented yet.

2018-07-02 Thread Eric Blake

On 06/29/2018 03:45 AM, Geoff Clare wrote:

Eric Blake  wrote, on 28 Jun 2018:


I'm forwarding an email originally sent to the Cygwin list. What do others
think? Is there enough grounds in the argument below that the CX-shading in
POSIX is too strict compared to existing implementations, and that I ought
to open a bug to change the wording on the requirements of perror() vs.
stdout orientation?


This issue arose in 2005 when C99 TC2 added perror() to the list of
byte input/output functions and created a conflict with POSIX.  The
end result was that C99 TC3 removed it so that POSIX would not need
to change.

See http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_322.htm



Although not in the C standard, we should also make sure that psignal() 
and psiginfo() have the same treatment as whatever we decide for 
perror(), since all three share the wording about "shall not change the 
orientation of the standard error stream".


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: perror() changes the orientation of stderr to byte-oriented mode if stderr is not oriented yet.

2018-06-28 Thread Eric Blake
I'm forwarding an email originally sent to the Cygwin list. What do 
others think? Is there enough grounds in the argument below that the 
CX-shading in POSIX is too strict compared to existing implementations, 
and that I ought to open a bug to change the wording on the requirements 
of perror() vs. stdout orientation?


On 06/28/2018 11:28 AM, Craig Howland wrote:

On 06/27/2018 08:55 AM, Corinna Vinschen wrote:

...


On Jun 27 20:01, Takashi Yano wrote:

POSIX states:
The perror() function shall not change the orientation of the standard
error stream.

However, cygwin perror() function changes the orientation of stderr to
byte-oriented mode if stderr is not oriented yet.
I suggest that POSIX is in error.  The POSIX statement about not 
changing the orientation is an extension to the C standard (CX, to be 
precise).  POSIX is always careful to defer to the C standard, which I 
think does indirectly specify that perror() is byte-oriented.  The C 
standard actually does not directly talk about the orientation of 
perror().  However, it directly defines (quoting from the N1570 C11 draft):


"The input/output functions are given the following collective terms:
— The wide character input functions — those functions described in 7.29 
that perform input into wide characters and wide strings: fgetwc, 
fgetws, getwc, getwchar, fwscanf, wscanf, vfwscanf, and vwscanf.
— The wide character output functions — those functions described in 
7.29 that perform output from wide characters and wide strings: fputwc, 
fputws, putwc, putwchar, fwprintf, wprintf, vfwprintf, and vwprintf.
— The wide character input/output functions — the union of the ungetwc 
function, the wide character input functions, and the wide character 
output functions.
— The byte input/output functions — those functions described in this 
subclause that perform input/output: fgetc, fgets, fprintf, fputc, 
fputs, fread, fscanf, fwrite, getc, getchar, printf, putc, putchar, 
puts, scanf, ungetc, vfprintf, vfscanf, vprintf, and vscanf."


Please note that perror() is not listed.  While this could be 
interpreted to mean it can be both, the proper way for that have to been 
done would be for it to appear in both lists--which it does not.  
However, perror() is defined in the same stdio.h subclause (i.e. 7.21) 
as all of the byte functions, against the wide-character functions in 
wchar.h (7.29).  So even though the C standard is sloppy and does not 
directly have perror() in the enumerated list, it is included by the 
general statement about the subclause. However, you could argue that it 
was purposely left out, which is why they bothered to list the others. 
Against this are the definition or perror(), itself, and that they 
really should have listed perror() as an exception if it was so 
intended, and (as already-mentioned) perror() should be in both lists if 
it is to be dual-oriented.


Here is the argument based on the perror() definition:

"void perror(const char *s);
...
It writes a sequence of characters to the standard error stream thus: 
first (if s is not a null pointer and the character pointed to by s is 
not the null character), the string pointed to by s followed by a colon 
(:) and a space; then an appropriate error message string followed by a 
new-line character."


Things to note:
1)  It is a regular character pointer, not a wide character pointer.  
Those characters, if supplied, are written.  (It says nothing about 
converting them to wide if need be, it says "the string pointed to by s".)
2)  "error message string".  It does not say 'or wide-character error 
message string if needed'.
3)  Followed by a "new-line character".  It does not say "new-line wide 
character", which is used throughout the wchar.h section (7.29).


So there is definitely a weakness in the C standard, but I think it is 
clear that perror() is a byte output function.  If the user wants to 
print to a wide-character stream, the only pure way to do it would be to 
turn strerror() (used by perror()) output into a wide-character string.  
POSIX noted this weakness, but fixed it with a bad extension, rather 
than classifying perror() as byte--which is clearly is.


Therefore, the newlib perror() behavior is correct and should not be 
changed. It definitely is a mess and there really ought to be a 
perrorw() function.




--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: can [[:digit:]] match something other than 0123456789?

2018-05-18 Thread Eric Blake

On 05/18/2018 12:24 PM, Wheeler, David A wrote:

This conversation seems strange; many locales use digits other than 0-9 to 
represent numbers.

The Eastern Arabic, Perso-Arabic variant, and Urdu variant all have digits, 
they just aren't 0-9.  In Unicode/ISO-646 in particular there are the digits 
U+0660 through U+0669 and U+06F0 through U+06F9.  When I visited Saudi Arabia I 
saw the Eastern Arabic digits everywhere, not just 0-9.  For more:
https://en.wikipedia.org/wiki/Eastern_Arabic_numerals

Here's an example, U+0662:
http://www.fileformat.info/info/unicode/char/0662/index.htm
This is a decimal digit with value 2.  Java agrees.

It sounds like there are different use cases.  Maybe there needs to be a standard way to represent different 
cases, e.g., "exactly 0-9", "a  digit in the current locale", and "a member of 
Unicode Character Category 'Number, Decimal Digit'".  I don't know if there's a need to distinguish the 
second and third cases.  It seems to me that [[:digit::]] should mean the second or third case.


The problem is that the  definition of isdigit() means only the 
first case (exactly the locale-independent 10 digits in the portable 
file name character set, whether locales are based on ASCII or EBCDIC), 
and the definition of [[:FOO:]] defers to  isFOO() where 
possible.  Yes, it may be nice to have additional classification 
routines, but as has been pointed out elsewhere in this thread, doing it 
solely by one character at a time may not be sufficient to capture all 
Unicode rules compared to what people really want to search for (for 
example, when searching for a character with an accent, you want to be 
able to find both the composed character, and the sequence of a plain 
character plus combining mark character, that both represent the same 
concept, but an iswFOO() test does not work on the latter example, since 
it occupies more than one character).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Eric Blake

On 05/15/2018 03:43 PM, Stephane Chazelas wrote:


Does that mean that [0-9] is also guaranteed to match on
0123456789 only? And that then [[:digit:]] in regexp/fnmatch is
close to useless as it's longer than [0-9]


Yes, I think that's a fair conclusion for the C locale, by virtue of the 
fact that the standard requires the encoding for 0-9 to be contiguous 
and in order.



and is a bit
misleading as it suggests it would be affected by localisation
(like the other character classes) while it's not.


It's still useful in non-C locales within regexp, since ALL uses of - 
for ranges within [] has unspecified (or was it implementation-defined) 
semantics outside of the C locale.  Using a named reference guarantees 
the desired semantics of exactly 10 characters, rather than skirting on 
the grounds of whether the range operator behaves as desired in all 
locales rather than just the C locale.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Eric Blake

On 05/15/2018 12:50 PM, Stephane Chazelas wrote:

You're a bit late to the party on this question :)


   digit
   Define the characters to be classified as numeric digits.

   In the POSIX locale, only:

0 1 2 3 4 5 6 7 8 9


Please read http://austingroupbugs.net/view.php?id=1078 where this 
wording has been tightened to cover ALL locales, not just the POSIX 
locale, to better match with C requirements on isdigit().


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: Laundry list

2018-04-27 Thread Eric Blake
On 04/27/2018 12:10 PM, Martijn Dekker wrote:

> 
> I don't know of any way to accomplish that except by the de-facto
> standard mechanism of "#! /usr/bin/env sh". There is a long-time and
> highly widespread expectation that this will work.
> 
>>   In addition to shell
>> scripts, the shebang hack is also commonly used with awk and sed
>> scripts (just to name two other POSIX-specified languages).
> 
> IMO, that's another good reason to standardise the hashbang path plus
> the location of /usr/bin/env.

If we standardize #! and the existence of /usr/bin/env, we should also
consider standardizing the BSD invention of 'env -S' that GNU coreutils
is now copying, as it serves as a very nice workaround for passing
multiple arguments to the real interpreter through the #! line even when
the OS passes only a single argument to env (as the #! interpreter).

https://www.freebsd.org/cgi/man.cgi?query=env
https://lists.gnu.org/archive/html/coreutils/2018-04/msg00011.html

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [1003.1(2008)/Issue 7 0001064]: basename() and dirname(): Specification is not complete enough to allow existing thread-unsafe implementations

2017-12-15 Thread Eric Blake
On 12/14/2017 09:47 PM, Robert Elz wrote:
> One final question about the intent for basename() / dirname() ...
> 
> With the current (Issue7.TC2) wording of how these functions are defined,
> it is clear that the sequence
> 
>   bn1 = basename(buf1);
>   bn2 = basename(buf2);
> 
> leaves bn1 undefined (the value may have been modified.)
> 
> The proposed new wording does not allow that any more.

Well, it does for at least the corner case of an implementation that
returns "." and "/" by modifying a single internal buffer, rather than
by pointing to two separate const strings.  But yes, having bn1 persist
even after the second basename() within the same thread matches existing
behavior of all implementations that modified the caller's input.

>   I understand the
> intent is to require that implementations become thread safe, but that
> could be achieved using thread local storage for a static buffer to
> hold the result - but if the above is required to work (that is, the
> results of basename() and dirname() are not permitted to be overwritten
> by a subsequent call in the same thread) then that will not work, and the
> only implementation technique possible (that I can think of anyway) will
> be buffer modification.

Thread local storage for a static buffer may not be large enough to
support all possible inputs, and the goal was that the function cannot
fail.  Thus, in-buffer modification is the only viable solution for
large inputs.

> 
> That's OK with me (some NetBSD developers are less than thrilled about
> all of this...) if that is the intent, I just wanted to make sure that
> it was understood what the effects of this change are, compared with
> what was there before.
> 
> For what it is worth, if I had to guess, I'd say that the likely NetBSD
> outcome of all of this is that we deprecate the 2 functions - mark them
> as not to be used any more (though they'd still be supported in an
> Issue 8 compatible way for compliance) and switch everything in the
> NetBSD src tree to use a new interface (perhaps a slightly reworked
> version of Ed Schouten's proposal in his note added to this issue in
> the middle of last year (2016)).

That approach matches what the GNU folks have already done years ago,
when gnulib introduced their own base_name() and dir_name() functions
with different (but reliable) semantics, eschewing the use of basename()
and dirname() in GNU code.  (Another aspect of the GNU code is that on
DOS-like systems, base_name() handles drive letters, which is something
that basename() completely ignores because POSIX does not have the
notion of drive letters.)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: Should "exec" run a shell function?

2017-07-18 Thread Eric Blake
On 07/18/2017 04:48 AM, Geoff Clare wrote:
> 
> On page 2398 line 76737 section 2.14 exec, add to EXAMPLES:
> 
> Execute the implementation's printf utility, ensuring that any
> shell built-in version is not executed instead, and using a subshell
> so that the shell continues afterwards:
> 
> (exec printf '%g\n' "$float_value")

The standard does not require printf %g to work; can we use a better
example?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature