Ingo Schwarze wrote on Fri, Nov 17, 2017 at 03:07:48PM +0100:

[ regarding cases where this may matter in practice ]
>  (2) Programs legitimately calling *printf() with a variable format
>      string in any non-POSIX locale, even if it's just UTF-8.

Whoa.  I just realized there is a very widespread subclass of this:
Programs using gettext(3) for printf(3) format strings.
They are easy to grep for with /printf\(_/.

Here are a few examples:

  aspell-0.60.6.1/prog/aspell.cpp: setlocale (LC_ALL, "");
  aspell-0.60.6.1/prog/aspell.cpp: printf(_("\n  %s filter: %s\n"), ...
    and many other instances, but i don't see any return value checks

  avahi-0.7/avahi-utils/avahi-browse.c: setlocale(LC_ALL, "");
  avahi-0.7/avahi-utils/avahi-browse.c: printf(_(": Cache exhausted\n"));
    and some others, again no return value checks

  e2fsprogs-1.42.12/resize/main.c: setlocale(LC_CTYPE, "");
  e2fsprogs-1.42.12/resize/online.c: printf(_("Filesystem at %s is mounted...
    and hundreds of others, no return value checks

  geeqie-1.3/src/main.c: setlocale(LC_ALL, "");
  geeqie-1.3/src/main.c: log_printf(_("Creating %s dir:%s\n"), ...
    dozens of them, no return value checks in sight

  git-2.14.1/gettext.c: setlocale(LC_CTYPE, "");
  git-2.14.1/builtin/merge.c: printf(_("Wonderful.\n"));
    hundreds of them, hard to say whether there are any return value
    checks, quite possibly not

  gnupg-2.1.23/common/i18n.c: setlocale (LC_ALL, "" );
  gnupg-2.1.23/g10/keygen.c: tty_printf(_("Invalid selection.\n"));
    dozens of them, no return value checks in sight

There are also many ports using g_strdup_printf(3) in this way, no
idea what that does, but is seems likely to call *printf(3) internally
in some way.

The show goes on (without checking for setlocale(3) to save time):
gnutls, libiconv, libv4l, mutt, openjp2, openldap, postgresql, vlc,
xz, ...

These are just some ports that i happened to build from source
recently.  So basically, almost *everybody* is using this, but
hardly anybody ever checks for success or failure.

When the return value is not checked, the change still makes the
following difference: Without the change, an invalidly encoded
format string prints nothing.  With the change, an invalidly encoded
format string prints invalidly encoded output.   The former may
sometimes be safer, but the missing information might sometimes
lead to trouble.


That said, i just checked what glibc and commercial Solaris 11 do,
and lo and behold:

  schwarze@donnerwolke:~/Test/printf$ uname -a
  Linux donnerwolke.asta-kit.de 4.9.0-0.bpo.3-686 #1 SMP \
    Debian 4.9.30-2+deb9u2~bpo8+1 (2017-06-27) i686 GNU/Linux

  schwarze@donnerwolke:~/Test/printf$ cat printf.c
  #include <err.h>
  #include <locale.h>
  #include <stdio.h>
  
  int
  main(void)
  {
        int irc;

        if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL)
                errx(1, "setlocale");
        irc = printf("start\xc3\xa9middle\xc3%s\n", "end");
        printf("%d\n", irc);
        return 0;
  }

  schwarze@donnerwolke:~/Test/printf$ make printf
  cc     printf.c   -o printf

  schwarze@donnerwolke:~/Test/printf$ ./printf > tmp.txt ; echo $?
  0

  schwarze@donnerwolke:~/Test/printf$ hexdump -C tmp.txt
  00000000  73 74 61 72 74 c3 a9 6d  69 64 64 6c 65 c3 65 6e |start..middle.en|
  00000010  64 0a 31 38 0a                                   |d.18.|
  00000015

Same thing on Solaris.

Judging from a superficial look at the FreeBSD and NetBSD sources,
they don't appear to validate the format either.

So even if my reading of the standard should be correct (which some
here have challenged), given that everybody else here intuitively
expected the function to behave differently, that the stuff is
actually used a lot in practice, and given that most (if not all)
other implementations appear to behave the way that people intuitively
expect, i think i should stand down and no longer object to the
change.  Maybe i should consider a small clarification in the manual
page afterwards, not yet sure whether that is needed.

I don't think, though, that the commit message should advertise
this as a performance improvement.  It should be called an intentional
change of behaviour, now using the format string as a byte string
like everyone else, no matter whether POSIX explicitly specifies
it as a character string instead.

Yours,
  Ingo

Reply via email to