Ingo Schwarze wrote on Fri, Nov 17, 2017 at 03:07:48PM +0100:
[ regarding cases where this may matter in practice ]
> (2) Programs legitimately calling *printf() with a variable format
> string in any non-POSIX locale, even if it's just UTF-8.
Whoa. I just realized there is a very widespread subclass of this:
Programs using gettext(3) for printf(3) format strings.
They are easy to grep for with /printf\(_/.
Here are a few examples:
aspell-0.60.6.1/prog/aspell.cpp: setlocale (LC_ALL, "");
aspell-0.60.6.1/prog/aspell.cpp: printf(_("\n %s filter: %s\n"), ...
and many other instances, but i don't see any return value checks
avahi-0.7/avahi-utils/avahi-browse.c: setlocale(LC_ALL, "");
avahi-0.7/avahi-utils/avahi-browse.c: printf(_(": Cache exhausted\n"));
and some others, again no return value checks
e2fsprogs-1.42.12/resize/main.c: setlocale(LC_CTYPE, "");
e2fsprogs-1.42.12/resize/online.c: printf(_("Filesystem at %s is mounted...
and hundreds of others, no return value checks
geeqie-1.3/src/main.c: setlocale(LC_ALL, "");
geeqie-1.3/src/main.c: log_printf(_("Creating %s dir:%s\n"), ...
dozens of them, no return value checks in sight
git-2.14.1/gettext.c: setlocale(LC_CTYPE, "");
git-2.14.1/builtin/merge.c: printf(_("Wonderful.\n"));
hundreds of them, hard to say whether there are any return value
checks, quite possibly not
gnupg-2.1.23/common/i18n.c: setlocale (LC_ALL, "" );
gnupg-2.1.23/g10/keygen.c: tty_printf(_("Invalid selection.\n"));
dozens of them, no return value checks in sight
There are also many ports using g_strdup_printf(3) in this way, no
idea what that does, but is seems likely to call *printf(3) internally
in some way.
The show goes on (without checking for setlocale(3) to save time):
gnutls, libiconv, libv4l, mutt, openjp2, openldap, postgresql, vlc,
xz, ...
These are just some ports that i happened to build from source
recently. So basically, almost *everybody* is using this, but
hardly anybody ever checks for success or failure.
When the return value is not checked, the change still makes the
following difference: Without the change, an invalidly encoded
format string prints nothing. With the change, an invalidly encoded
format string prints invalidly encoded output. The former may
sometimes be safer, but the missing information might sometimes
lead to trouble.
That said, i just checked what glibc and commercial Solaris 11 do,
and lo and behold:
schwarze@donnerwolke:~/Test/printf$ uname -a
Linux donnerwolke.asta-kit.de 4.9.0-0.bpo.3-686 #1 SMP \
Debian 4.9.30-2+deb9u2~bpo8+1 (2017-06-27) i686 GNU/Linux
schwarze@donnerwolke:~/Test/printf$ cat printf.c
#include <err.h>
#include <locale.h>
#include <stdio.h>
int
main(void)
{
int irc;
if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL)
errx(1, "setlocale");
irc = printf("start\xc3\xa9middle\xc3%s\n", "end");
printf("%d\n", irc);
return 0;
}
schwarze@donnerwolke:~/Test/printf$ make printf
cc printf.c -o printf
schwarze@donnerwolke:~/Test/printf$ ./printf > tmp.txt ; echo $?
0
schwarze@donnerwolke:~/Test/printf$ hexdump -C tmp.txt
00000000 73 74 61 72 74 c3 a9 6d 69 64 64 6c 65 c3 65 6e |start..middle.en|
00000010 64 0a 31 38 0a |d.18.|
00000015
Same thing on Solaris.
Judging from a superficial look at the FreeBSD and NetBSD sources,
they don't appear to validate the format either.
So even if my reading of the standard should be correct (which some
here have challenged), given that everybody else here intuitively
expected the function to behave differently, that the stuff is
actually used a lot in practice, and given that most (if not all)
other implementations appear to behave the way that people intuitively
expect, i think i should stand down and no longer object to the
change. Maybe i should consider a small clarification in the manual
page afterwards, not yet sure whether that is needed.
I don't think, though, that the commit message should advertise
this as a performance improvement. It should be called an intentional
change of behaviour, now using the format string as a byte string
like everyone else, no matter whether POSIX explicitly specifies
it as a character string instead.
Yours,
Ingo