2015-10-23 23:00 GMT+02:00 Ingo Schwarze <schwa...@usta.de>: > Hi Ted, > > Ted Unangst wrote on Fri, Oct 23, 2015 at 09:38:22AM -0400: > >> so that works with the diff below. > > I agree with the direction for this kind of tool, at least for now. > However, your diff has a few issues, so i improved it, see below. > > Any OKs or vetos? > > Ted, in case you want to commit, the version below is obviously > OK schwarze@. > >> i'm not sure how far down this road we need >> to travel, but i figure it's worth a little exploration. > > I think making any valid sequence of single-codepoint characters > work is reasonable, in particular if it just takes 15 lines > of additional code in a utility of 500 lines. > > > Changes with respect to tedu@'s version: > > * chunk 151 and chunk 158: unchanged > * chunk 211: new chunk > Required for the sequence underscore, backspace, multibyte character: > Mark all the bytes underlined, not just the first one, or the > multibyte character will be broken. > * chunk 237 part 1: new change > Required such that bytes with the high bit set compare equal > even on signed char architectures. > * chunk 237 part 2: style tweak > Actually use the shiny new isu8cont() function, > do not inline a copy of its code. > > > Aspects not solved and other comments: > > - The new code runs always. > In a POSIX locale, text files are not supposed to contain bytes > with the high bit set, so it is undefined in the first place > what ul(1) should do. Of course, we could artificially add yet > more code (heavy-weight code with setlocale(3) and nl_langinfo(3), > actually) to gratuitiously mess the file up, but i consider it > more useful to treat UTF-8 gracefully even when the locale is > not set, such that ul(1) output is predictable independently of > the user's locale.
Well, we could create another helper, like 'isutf8on()', and just call it when needed. So the actual logic won't be hurt too much... I won't insist hard on that, though. > - character, backspace, different character > This is not valid backspace encoding for bold or italic, > so ul(1) is not supposed to handle it. But at least, it > no longer produces invalid UTF-8 even in that case. > - The FreeBSD change with wchar_t (+70 -44 lines) seems > like overkill to me. > - Nothing changes with respect to tabs. > To ul(1), tabs just mean "add enough blanks to advance to the > next character position that is a multiple of eight". A backspace > will then remove the last one of them. The usefulness of this > feature may be argued, but that's unrelated to UTF-8. > > > Index: ul.c > =================================================================== > RCS file: /cvs/src/usr.bin/ul/ul.c,v > retrieving revision 1.19 > diff -u -p -r1.19 ul.c > --- ul.c 10 Oct 2015 16:15:03 -0000 1.19 > +++ ul.c 23 Oct 2015 20:19:17 -0000 > @@ -151,6 +151,12 @@ main(int argc, char *argv[]) > exit(0); > } > > +int > +isu8cont(unsigned char c) > +{ > + return (c & (0x80 | 0x40)) == 0x80; > +} > + > void > mfilter(FILE *f) > { > @@ -158,8 +164,11 @@ mfilter(FILE *f) > > while ((c = getc(f)) != EOF && col < MAXBUF) switch(c) { > case '\b': > - if (col > 0) > + while (col > 0) { > col--; > + if (!isu8cont(obuf[col].c_char)) > + break; > + } > continue; > case '\t': > col = (col+8) & ~07; > @@ -211,9 +220,13 @@ mfilter(FILE *f) > continue; > > case '_': > - if (obuf[col].c_char) > + if (obuf[col].c_char != '\0') { > obuf[col].c_mode |= UNDERL | mode; > - else > + if (obuf[col].c_char & 0x80) > + while (col < maxcol & > + isu8cont(obuf[col+1].c_char)) > + obuf[++col].c_mode |= UNDERL | mode; > + } else > obuf[col].c_char = '_'; Shouldn't the last part be something like that instead? else { while (col < maxcol & isu8cont(obuf[col+1].c_char)) obuf[++col].c_mode |= UNDERL | mode; obuf[col].c_char = '_'; } I could be easy wrong, still trying to understand the intention behind the original code... > /* FALLTHROUGH */ > case ' ': > @@ -237,10 +250,12 @@ mfilter(FILE *f) > } else if (obuf[col].c_char == '_') { > obuf[col].c_char = c; > obuf[col].c_mode |= UNDERL|mode; > - } else if (obuf[col].c_char == c) > + } else if (obuf[col].c_char == (char)c) Did you actuall wanted "(unsigned char)c" here? > obuf[col].c_mode |= BOLD|mode; > else > obuf[col].c_mode = mode; > + if (col > 0 && isu8cont(c)) > + obuf[col].c_mode = obuf[col - 1].c_mode; Is this right? Shouldn't it just ignore c_mode changing in isu8cont(c)==1 case, and on isu8cont(c)==0 case go back through all isu8cont()==1 characters and set their mode based on non-cont character following them? > col++; > if (col > maxcol) > maxcol = col; > -- WBR, Vadim Zhukov