2015-10-23 23:00 GMT+02:00 Ingo Schwarze <schwa...@usta.de>:
> Hi Ted,
>
> Ted Unangst wrote on Fri, Oct 23, 2015 at 09:38:22AM -0400:
>
>> so that works with the diff below.
>
> I agree with the direction for this kind of tool, at least for now.
> However, your diff has a few issues, so i improved it, see below.
>
> Any OKs or vetos?
>
> Ted, in case you want to commit, the version below is obviously
> OK schwarze@.
>
>> i'm not sure how far down this road we need
>> to travel, but i figure it's worth a little exploration.
>
> I think making any valid sequence of single-codepoint characters
> work is reasonable, in particular if it just takes 15 lines
> of additional code in a utility of 500 lines.
>
>
> Changes with respect to tedu@'s version:
>
>  * chunk 151 and chunk 158: unchanged
>  * chunk 211: new chunk
>    Required for the sequence underscore, backspace, multibyte character:
>    Mark all the bytes underlined, not just the first one, or the
>    multibyte character will be broken.
>  * chunk 237 part 1: new change
>    Required such that bytes with the high bit set compare equal
>    even on signed char architectures.
>  * chunk 237 part 2: style tweak
>    Actually use the shiny new isu8cont() function,
>    do not inline a copy of its code.
>
>
> Aspects not solved and other comments:
>
>  - The new code runs always.
>    In a POSIX locale, text files are not supposed to contain bytes
>    with the high bit set, so it is undefined in the first place
>    what ul(1) should do.  Of course, we could artificially add yet
>    more code (heavy-weight code with setlocale(3) and nl_langinfo(3),
>    actually) to gratuitiously mess the file up, but i consider it
>    more useful to treat UTF-8 gracefully even when the locale is
>    not set, such that ul(1) output is predictable independently of
>    the user's locale.

Well, we could create another helper, like 'isutf8on()', and just call
it when needed. So the actual logic won't be hurt too much... I won't
insist hard on that, though.

>  - character, backspace, different character
>    This is not valid backspace encoding for bold or italic,
>    so ul(1) is not supposed to handle it.  But at least, it
>    no longer produces invalid UTF-8 even in that case.
>  - The FreeBSD change with wchar_t (+70 -44 lines) seems
>    like overkill to me.
>  - Nothing changes with respect to tabs.
>    To ul(1), tabs just mean "add enough blanks to advance to the
>    next character position that is a multiple of eight".  A backspace
>    will then remove the last one of them.  The usefulness of this
>    feature may be argued, but that's unrelated to UTF-8.
>
>
> Index: ul.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/ul/ul.c,v
> retrieving revision 1.19
> diff -u -p -r1.19 ul.c
> --- ul.c        10 Oct 2015 16:15:03 -0000      1.19
> +++ ul.c        23 Oct 2015 20:19:17 -0000
> @@ -151,6 +151,12 @@ main(int argc, char *argv[])
>         exit(0);
>  }
>
> +int
> +isu8cont(unsigned char c)
> +{
> +       return (c & (0x80 | 0x40)) == 0x80;
> +}
> +
>  void
>  mfilter(FILE *f)
>  {
> @@ -158,8 +164,11 @@ mfilter(FILE *f)
>
>         while ((c = getc(f)) != EOF && col < MAXBUF) switch(c) {
>         case '\b':
> -               if (col > 0)
> +               while (col > 0) {
>                         col--;
> +                       if (!isu8cont(obuf[col].c_char))
> +                               break;
> +               }
>                 continue;
>         case '\t':
>                 col = (col+8) & ~07;
> @@ -211,9 +220,13 @@ mfilter(FILE *f)
>                 continue;
>
>         case '_':
> -               if (obuf[col].c_char)
> +               if (obuf[col].c_char != '\0') {
>                         obuf[col].c_mode |= UNDERL | mode;
> -               else
> +                       if (obuf[col].c_char & 0x80)
> +                               while (col < maxcol &
> +                                   isu8cont(obuf[col+1].c_char))
> +                                       obuf[++col].c_mode |= UNDERL | mode;
> +               } else
>                         obuf[col].c_char = '_';

Shouldn't the last part be something like that instead?

else {
     while (col < maxcol & isu8cont(obuf[col+1].c_char))
        obuf[++col].c_mode |= UNDERL | mode;
    obuf[col].c_char = '_';
}

I could be easy wrong, still trying to understand the intention behind
the original code...

>                 /* FALLTHROUGH */
>         case ' ':
> @@ -237,10 +250,12 @@ mfilter(FILE *f)
>                 } else if (obuf[col].c_char == '_') {
>                         obuf[col].c_char = c;
>                         obuf[col].c_mode |= UNDERL|mode;
> -               } else if (obuf[col].c_char == c)
> +               } else if (obuf[col].c_char == (char)c)

Did you actuall wanted "(unsigned char)c" here?

>                         obuf[col].c_mode |= BOLD|mode;
>                 else
>                         obuf[col].c_mode = mode;
> +               if (col > 0 && isu8cont(c))
> +                       obuf[col].c_mode = obuf[col - 1].c_mode;

Is this right? Shouldn't it just ignore c_mode changing in
isu8cont(c)==1 case, and on isu8cont(c)==0 case go back through all
isu8cont()==1 characters and set their mode based on non-cont
character following them?

>                 col++;
>                 if (col > maxcol)
>                         maxcol = col;
>

--
  WBR,
  Vadim Zhukov

Reply via email to