I'm always frustrated when a unicode character question comes up and I have to look up the UTF-8 byte sequence to reproduce it. When fixing \x I found the \u and \U escape sequences in gprintf, which seem mighty handy for this exact case.
My implementation differs from gprintf in that leading zeroes can be omitted, but I kept \u and \U for both compatability and for cases like \u00ebb, where I don't want to add 6 zeroes just to get my desired unicode character in front of an isxdigit(3) character. gprintf talks about "Unicode (ISO/IEC 10646)" in their manpage for the \u case and just Unicode for the \U case. I read that glibc uses 10646 internally for wchar_t, but I have no idea how 10646 might differ from true unicode for >= 0 <= 0xffff, so I stuck with just the term unicode in the manpage part. gprintf prints the \u or \U form for characters > 0x7f and < 0x100 in the C locale, where this diff currently outputs these byte values. My previous diff[0] should fix this. OK after unlock? martijn@ [0] https://marc.info/?l=openbsd-tech&m=161875718324367&w=2 Index: printf.1 =================================================================== RCS file: /cvs/src/usr.bin/printf/printf.1,v retrieving revision 1.34 diff -u -p -r1.34 printf.1 --- printf.1 16 Jan 2020 16:46:47 -0000 1.34 +++ printf.1 18 Apr 2021 15:46:44 -0000 @@ -103,6 +103,14 @@ Write a backslash character. Write an 8-bit character whose ASCII value is the 1-, 2-, or 3-digit octal number .Ar num . +.It Cm \eu Ns Ar num +Write a unicode character whose value is +the 1-, 2-, 3-, or 4-digit hexadecimal number +.Ar num . +.It Cm \eU Ns Ar num +Write a unicode character whose value is +the 1-, 2-, 3-, 4-, 5-, 6-, 7-, or 8-digit hexadecimal number +.Ar num . .El .Pp Each format specification is introduced by the percent @@ -356,6 +364,19 @@ no argument is used. In no case does a non-existent or small field width cause truncation of a field; padding takes place only if the specified field width exceeds the actual width. +.Sh ENVIRONMENT +.Bl -tag -width LC_CTYPE +.It Ev LC_CTYPE +The character encoding +.Xr locale 1 . +It decides which unicode values can be output in the current character encoding. +If a character can't be displayed in the current locale it falls back to the +shortest full +.Cm \eu Ns Ar num +or +.Cm \eU Ns Ar num +presentation. +.El .Sh EXIT STATUS .Ex -std printf .Sh EXAMPLES @@ -383,7 +404,9 @@ and always operates as if were set. .Pp The escape sequences -.Cm \ee +.Cm \ee , +.Cm \eu , +.Cm \eU and .Cm \e' , as well as omitting the leading digit Index: printf.c =================================================================== RCS file: /cvs/src/usr.bin/printf/printf.c,v retrieving revision 1.26 diff -u -p -r1.26 printf.c --- printf.c 18 Nov 2016 15:53:16 -0000 1.26 +++ printf.c 18 Apr 2021 15:46:44 -0000 @@ -33,10 +33,12 @@ #include <err.h> #include <errno.h> #include <limits.h> +#include <locale.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> +#include <wchar.h> static int print_escape_str(const char *); static int print_escape(const char *); @@ -79,6 +81,8 @@ main(int argc, char *argv[]) char convch, nextch; char *format; + setlocale(LC_CTYPE, ""); + if (pledge("stdio", NULL) == -1) err(1, "pledge"); @@ -275,8 +279,10 @@ static int print_escape(const char *str) { const char *start = str; + char mbc[MB_LEN_MAX + 1]; + wchar_t wc = 0; int value; - int c; + int c, i; str++; @@ -348,6 +354,25 @@ print_escape(const char *str) case 't': /* tab */ putchar('\t'); break; + + case 'U': + case 'u': + c = *str == 'U' ? 8 : 4; + str++; + for (; c-- && isxdigit((unsigned char)*str); str++) { + wc <<= 4; + wc += hextobin(*str); + } + if ((c = wctomb(mbc, wc)) == -1) { + printf("\\%c%0*X", wc > 0xffff ? 'U' : 'u', + wc > 0xffff ? 8 : 4, wc); + wc = L'\0'; + wctomb(NULL, wc); + } else { + for (i = 0; i < c; i++) + putchar(mbc[i]); + } + return str - start - 1; case 'v': /* vertical-tab */ putchar('\v');
