Re: locale in locate(1)
Hi, Jan Stary wrote on Thu, Jan 10, 2019 at 09:35:48PM +0100: > Does locate(1) need to setlocale(3)? Committed as part of a much larger diff. Note that answering that question required substantial code review and code cleanup. When auditing, don't just mechanically look for suspicious patterns. You have to actually inspect and understand the code to make an an audit useful. Something like the patch below is merely a starting point for an audit (trivial to find), not a result. Yours, Ingo > Index: locate/locate.c > === > RCS file: /cvs/src/usr.bin/locate/locate/locate.c,v > retrieving revision 1.31 > diff -u -p -r1.31 locate.c > --- locate/locate.c 19 Nov 2015 21:46:05 - 1.31 > +++ locate/locate.c 10 Jan 2019 20:34:16 - > @@ -73,7 +73,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -122,7 +121,6 @@ main(int argc, char *argv[]) > { > int ch; > char **dbv = NULL; > - (void) setlocale(LC_ALL, ""); > > if (pledge("stdio rpath", NULL) == -1) > err(1, "pledge");
Re: locale in locate(1)
Hi Ted, Ted Unangst wrote on Sun, Jan 13, 2019 at 03:22:41AM -0500: > Ingo Schwarze wrote: >> So, here is some cleanup: >> * garbage collect useless hand-rolled lookup tables >> * merge one-line helper functions into callers >> * garbage collect obfuscating macros >> * fix one format string error (%d used for u_int) >> * garbage collect setlocale(3) and >> * garbage collect several unused constants >> * minus 45 LOC, no functional change >> >> OK? > I think you may be fixing a bug here... Is that an OK? >> Index: locate/fastfind.c >> === >> RCS file: /cvs/src/usr.bin/locate/locate/fastfind.c,v >> retrieving revision 1.14 >> diff -u -p -r1.14 fastfind.c >> --- locate/fastfind.c8 Dec 2017 17:26:42 - 1.14 >> +++ locate/fastfind.c13 Jan 2019 02:07:47 - >> @@ -126,10 +126,8 @@ fastfind_mmap >> u_char bigram1[NBG], bigram2[NBG], path[PATH_MAX]; >> >> #ifdef FF_ICASE >> -/* use a lookup table for case insensitive search */ >> -u_char table[UCHAR_MAX + 1]; >> - >> -tolower_word(pathpart); >> +for (p = pathpart; *p != '\0'; p++) >> +*p = tolower(*p); >> #endif /* FF_ICASE*/ >> >> /* init bigram table */ >> @@ -156,11 +154,8 @@ fastfind_mmap >> p = pathpart; >> patend = patprep(p); >> cc = *patend; >> - >> #ifdef FF_ICASE >> -/* set patend char to true */ >> -table[TOLOWER(*patend)] = 1; >> -table[toupper(*patend)] = 1; >> +cc = tolower(cc); >> #endif /* FF_ICASE */ > I cannot see where that table is zeroed. > It lives on the stack, it could contain anything. Good point. Then again, if that bug stings, it only causes bogus matches in the pre-match phase (and only with -i). The real comparison is done later on the complete strings, see the calls to fnmatch(3) and strstr(3) bwlow. So the bug probably only ruined -i performance, but did not lead to invalid results. Yours, Ingo
Re: locale in locate(1)
Ingo Schwarze wrote: > So, here is some cleanup: > * garbage collect useless hand-rolled lookup tables > * merge one-line helper functions into callers > * garbage collect obfuscating macros > * fix one format string error (%d used for u_int) > * garbage collect setlocale(3) and > * garbage collect several unused constants > * minus 45 LOC, no functional change > > OK? > Ingo I think you may be fixing a bug here... > Index: locate/fastfind.c > === > RCS file: /cvs/src/usr.bin/locate/locate/fastfind.c,v > retrieving revision 1.14 > diff -u -p -r1.14 fastfind.c > --- locate/fastfind.c 8 Dec 2017 17:26:42 - 1.14 > +++ locate/fastfind.c 13 Jan 2019 02:07:47 - > @@ -126,10 +126,8 @@ fastfind_mmap > u_char bigram1[NBG], bigram2[NBG], path[PATH_MAX]; > > #ifdef FF_ICASE > - /* use a lookup table for case insensitive search */ > - u_char table[UCHAR_MAX + 1]; > - > - tolower_word(pathpart); > + for (p = pathpart; *p != '\0'; p++) > + *p = tolower(*p); > #endif /* FF_ICASE*/ > > /* init bigram table */ > @@ -156,11 +154,8 @@ fastfind_mmap > p = pathpart; > patend = patprep(p); > cc = *patend; > - > #ifdef FF_ICASE > - /* set patend char to true */ > - table[TOLOWER(*patend)] = 1; > - table[toupper(*patend)] = 1; > + cc = tolower(cc); > #endif /* FF_ICASE */ I cannot see where that table is zeroed. It lives on the stack, it could contain anything.
Re: locale in locate(1)
Hi Scott, Scott Cheloha wrote on Thu, Jan 10, 2019 at 07:18:43PM -0600: > locate(1) uses tolower(3), the results of which can be affected by > locale settings. In general, yes. On OpenBSD, not really: $ man tolower | grep -A 1 always OpenBSD always uses the C locale for these functions, ignoring the global locale, the thread-specific locale, and the locale argument. That makes sense because: $ man tolower | grep -A 1 representable The argument c must be EOF or representable as an unsigned char; otherwise, the result is undefined. So locale(1) support for tolower(3) really only makes sense for single-byte non-ASCII locales (like LATIN-1) and we don't support any of those. The wchar_t characters needed for UTF-8 cannot be represented as unsigned chars. > locate(1)'s database is, at least in theory, portable. I guess you are trying to say the right thing... More precisely: Filenames are byte strings, *NOT* character strings, not even multibyte character strings. Consequently, a filename does *NOT* have a character encoding. The locate datebase simply stores these byte strings. That it uses its own, home-grown encoding scheme does not matter because that is not a *character* encoding but rather some very naive compression method. Since it does *NOT* store characters, the locate database format is locale-independent in the first place. > If you don't explicitly set the locale to POSIX/C I think it might > be possible to create a corrupted database, No. The database creation tools - updatedb, mklocatedb, bigram, code - all explicitly ignore the locale. There is no way to create a corrupt database. You will have fun copying a database from one machine to another machine where sizeof(int) is different, though. But that's not corrupt, just incompatible, and unrelated to character encoding. > or read a valid database as corrupted. No, that isn't possible either. Even if the tolower(3) function were locale-dependent on OpenBSD, the worst thing that could happen would be partially broken support for the locale(1) -i option. But it isn't, so whether you run locate(1) with the ASCII or with the UTF-8 locale makes no difference. That said, something else *can* go wrong. If filenames contain terminal control codes, running locate(1) can screw up your terminal: as far as i can see, it doesn't have the protections that ls(1) has, but that's a separate matter. A brief look into /usr/src/usr.bin/locale/ makes it obvious that it's a rathole of messy code requiring some love gently applied with Bob's whale flensing knife. For example, the multiple layers of wrappers around tolower(3) are totally pointless because native tolower(3) already does table lookup in the first place. So, here is some cleanup: * garbage collect useless hand-rolled lookup tables * merge one-line helper functions into callers * garbage collect obfuscating macros * fix one format string error (%d used for u_int) * garbage collect setlocale(3) and * garbage collect several unused constants * minus 45 LOC, no functional change OK? Ingo Index: locate/fastfind.c === RCS file: /cvs/src/usr.bin/locate/locate/fastfind.c,v retrieving revision 1.14 diff -u -p -r1.14 fastfind.c --- locate/fastfind.c 8 Dec 2017 17:26:42 - 1.14 +++ locate/fastfind.c 13 Jan 2019 02:07:47 - @@ -126,10 +126,8 @@ fastfind_mmap u_char bigram1[NBG], bigram2[NBG], path[PATH_MAX]; #ifdef FF_ICASE - /* use a lookup table for case insensitive search */ - u_char table[UCHAR_MAX + 1]; - - tolower_word(pathpart); + for (p = pathpart; *p != '\0'; p++) + *p = tolower(*p); #endif /* FF_ICASE*/ /* init bigram table */ @@ -156,11 +154,8 @@ fastfind_mmap p = pathpart; patend = patprep(p); cc = *patend; - #ifdef FF_ICASE - /* set patend char to true */ - table[TOLOWER(*patend)] = 1; - table[toupper(*patend)] = 1; + cc = tolower(cc); #endif /* FF_ICASE */ @@ -173,10 +168,11 @@ fastfind_mmap /* go forward or backward */ if (c == SWITCH) { /* big step, an integer */ - if (len < INTSIZE) + if (len < sizeof(int)) break; count += getwm(paddr) - OFFSET; - len -= INTSIZE; paddr += INTSIZE; + len -= sizeof(int); + paddr += sizeof(int); } else { /* slow step, =< 14 chars */ count += c - OFFSET; } @@ -207,7 +203,7 @@ fastfind_mmap break; /* SWITCH */ } #ifdef FF_ICASE - if (table[c]) + if (tolower(c) == cc) #else if
Re: locale in locate(1)
Scott Cheloha wrote: > > On Jan 10, 2019, at 14:35, Jan Stary wrote: > > > > Does locate(1) need to setlocale(3)? > > locate(1) uses tolower(3), the results of which can be affected by > locale settings. locate(1)'s database is, at least in theory, portable. > If you don't explicitly set the locale to POSIX/C I think it might be > possible to create a corrupted database, or read a valid database > as corrupted. Speaking of portability... man locate says: The locate database is not byte order independent. It is not possible to share the databases between machines with different byte order. The current locate implementation understands databases in host byte order or network byte order. So a little-endian machine can't understand a locate database which was built on a big-endian machine. That doesn't really make sense. If it understands network byte order, then it can read big endian databases. And then there's this comment in the source: * Convert network byte order to host byte order if necessary. * So we can read on FreeBSD/i386 (little endian) a locate database * which was built on SunOS/sparc (big endian). So there's a little mystery to unravel for somebody with multiple CPUs. Does it work or not?
Re: locale in locate(1)
> Does locate(1) need to setlocale(3)? Since you ask the question, yes. Oh wait, was your question rhetorical? It lacks facts. Who's time are you wasting? > Index: locate/locate.c > === > RCS file: /cvs/src/usr.bin/locate/locate/locate.c,v > retrieving revision 1.31 > diff -u -p -r1.31 locate.c > --- locate/locate.c 19 Nov 2015 21:46:05 - 1.31 > +++ locate/locate.c 10 Jan 2019 20:34:16 - > @@ -73,7 +73,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -122,7 +121,6 @@ main(int argc, char *argv[]) > { > int ch; > char **dbv = NULL; > - (void) setlocale(LC_ALL, ""); > > if (pledge("stdio rpath", NULL) == -1) > err(1, "pledge"); >
Re: locale in locate(1)
Scott Cheloha wrote: > The ctype wizard may also have a very elegant explanation clarifying > whether this is correct or not. The diffs received are annoying. What problem does it solve? What are the downsides? Is there any education in the diffs as to why the changes are needed, so that people not freshly in the game can participate? No. Questions are not asked nor answered. Like, seriously come on. If you don't know why you are writing a diff or can't be bothered to articulate why you are writing the diff why are you sending it? I suggest we all sit this out until there is another try.
Re: locale in locate(1)
> On Jan 10, 2019, at 14:35, Jan Stary wrote: > > Does locate(1) need to setlocale(3)? locate(1) uses tolower(3), the results of which can be affected by locale settings. locate(1)'s database is, at least in theory, portable. If you don't explicitly set the locale to POSIX/C I think it might be possible to create a corrupted database, or read a valid database as corrupted. But tedu says we have a policy so maybe I am mistaken about this. The ctype wizard may also have a very elegant explanation clarifying whether this is correct or not.
locale in locate(1)
Does locate(1) need to setlocale(3)? Jan Index: locate/locate.c === RCS file: /cvs/src/usr.bin/locate/locate/locate.c,v retrieving revision 1.31 diff -u -p -r1.31 locate.c --- locate/locate.c 19 Nov 2015 21:46:05 - 1.31 +++ locate/locate.c 10 Jan 2019 20:34:16 - @@ -73,7 +73,6 @@ #include #include #include -#include #include #include #include @@ -122,7 +121,6 @@ main(int argc, char *argv[]) { int ch; char **dbv = NULL; - (void) setlocale(LC_ALL, ""); if (pledge("stdio rpath", NULL) == -1) err(1, "pledge");