printf(1): support \u and \U

Martijn van Duren Sun, 18 Apr 2021 08:53:19 -0700

I'm always frustrated when a unicode character question comes up and I
have to look up the UTF-8 byte sequence to reproduce it. When fixing \x
I found the \u and \U escape sequences in gprintf, which seem mighty
handy for this exact case.


My implementation differs from gprintf in that leading zeroes can be
omitted, but I kept \u and \U for both compatability and for cases like
\u00ebb, where I don't want to add 6 zeroes just to get my desired
unicode character in front of an isxdigit(3) character.

gprintf talks about "Unicode (ISO/IEC 10646)" in their manpage for the
\u case and just Unicode for the \U case. I read that glibc uses 10646
internally for wchar_t, but I have no idea how 10646 might differ from
true unicode for >= 0 <= 0xffff, so I stuck with just the term unicode
in the manpage part.

gprintf prints the \u or \U form for characters > 0x7f and < 0x100 in
the C locale, where this diff currently outputs these byte values.
My previous diff[0] should fix this.

OK after unlock?

martijn@

[0] https://marc.info/?l=openbsd-tech&m=161875718324367&w=2

Index: printf.1
===================================================================
RCS file: /cvs/src/usr.bin/printf/printf.1,v
retrieving revision 1.34
diff -u -p -r1.34 printf.1
--- printf.1    16 Jan 2020 16:46:47 -0000      1.34
+++ printf.1    18 Apr 2021 15:46:44 -0000
@@ -103,6 +103,14 @@ Write a backslash character.
 Write an 8-bit character whose ASCII value is
 the 1-, 2-, or 3-digit octal number
 .Ar num .
+.It Cm \eu Ns Ar num
+Write a unicode character whose value is
+the 1-, 2-, 3-, or 4-digit hexadecimal number
+.Ar num .
+.It Cm \eU Ns Ar num
+Write a unicode character whose value is
+the 1-, 2-, 3-, 4-, 5-, 6-, 7-, or 8-digit hexadecimal number
+.Ar num .
 .El
 .Pp
 Each format specification is introduced by the percent
@@ -356,6 +364,19 @@ no argument is used.
 In no case does a non-existent or small field width cause truncation of
 a field; padding takes place only if the specified field width exceeds
 the actual width.
+.Sh ENVIRONMENT
+.Bl -tag -width LC_CTYPE
+.It Ev LC_CTYPE
+The character encoding
+.Xr locale 1 .
+It decides which unicode values can be output in the current character 
encoding.
+If a character can't be displayed in the current locale it falls back to the
+shortest full
+.Cm \eu Ns Ar num
+or
+.Cm \eU Ns Ar num
+presentation.
+.El
 .Sh EXIT STATUS
 .Ex -std printf
 .Sh EXAMPLES
@@ -383,7 +404,9 @@ and always operates as if
 were set.
 .Pp
 The escape sequences
-.Cm \ee
+.Cm \ee ,
+.Cm \eu ,
+.Cm \eU
 and
 .Cm \e' ,
 as well as omitting the leading digit
Index: printf.c
===================================================================
RCS file: /cvs/src/usr.bin/printf/printf.c,v
retrieving revision 1.26
diff -u -p -r1.26 printf.c
--- printf.c    18 Nov 2016 15:53:16 -0000      1.26
+++ printf.c    18 Apr 2021 15:46:44 -0000
@@ -33,10 +33,12 @@
 #include <err.h>
 #include <errno.h>
 #include <limits.h>
+#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <wchar.h>
 
 static int      print_escape_str(const char *);
 static int      print_escape(const char *);
@@ -79,6 +81,8 @@ main(int argc, char *argv[])
        char convch, nextch;
        char *format;
 
+       setlocale(LC_CTYPE, "");
+
        if (pledge("stdio", NULL) == -1)
                err(1, "pledge");
 
@@ -275,8 +279,10 @@ static int
 print_escape(const char *str)
 {
        const char *start = str;
+       char mbc[MB_LEN_MAX + 1];
+       wchar_t wc = 0;
        int value;
-       int c;
+       int c, i;
 
        str++;
 
@@ -348,6 +354,25 @@ print_escape(const char *str)
        case 't':                       /* tab */
                putchar('\t');
                break;
+
+       case 'U':
+       case 'u':
+               c = *str == 'U' ? 8 : 4;
+               str++;
+               for (; c-- && isxdigit((unsigned char)*str); str++) {
+                       wc <<= 4;
+                       wc += hextobin(*str);
+               }
+               if ((c = wctomb(mbc, wc)) == -1) {
+                       printf("\\%c%0*X", wc > 0xffff ? 'U' : 'u',
+                           wc > 0xffff ? 8 : 4, wc);
+                       wc = L'\0';
+                       wctomb(NULL, wc);
+               } else {
+                       for (i = 0; i < c; i++)
+                               putchar(mbc[i]);
+               }
+               return str - start - 1;
 
        case 'v':                       /* vertical-tab */
                putchar('\v');

printf(1): support \u and \U

Reply via email to