[fltk.bugs] UTF-8 conversions [was: STR #2505: Xft backend doesn't filter badUTF-8 characters]

Albrecht Schlosser Fri, 07 Jan 2011 11:00:09 -0800

On 07.01.2011 10:24, Manolo Gouy wrote:

[part 2 to this question:]
> Could Xft be processed under Cygwin as it is under X11 ?
>
> That is, if one does, in fl_utf.c:
> unsigned fl_utf8towc(const char* src, unsigned srclen,
>                 wchar_t* dst, unsigned dstlen)
> {
> #if defined(WIN32) && !defined(__CYGWIN__)
>    return fl_utf8toUtf16(src, srclen, (unsigned short*)dst, dstlen);
> #else
> ...
> #endif
> }
>
> and removes the cygwin-special cases in Fl_Xlib_Graphics_Driver::draw
> and utf8extents() of fl_font_xft.cxx,
> would that run OK on cygwin ?


No, that wouldn't work, because fl_utf8towc() would still write a
wchar_t array, and wchar_t is 2 bytes on Windows. I had to change
it the other way around, and I was really surprised when I looked
closer at the code. The old code was:

#ifdef WIN32
   return fl_utf8toUtf16(src, srclen, (unsigned short*)dst, dstlen);
#else
...

My first thought was: ooh, that's *wrong*, it must be

#if defined(WIN32) || defined(__CYGWIN__)
...

But how could it have worked before, if fl_utf8towc() would have
returned a string (array) of 4 bytes for each character (UCS-4)?

The solution is simple, but surprising in the first place. Despite
the comment ("except on win32...") the code _tried_ to convert to
UCS-4, but then assigned the resulting UCS-4 byte to a wchar_t,
and thus truncated it to a 2-byte value:

  ... unsigned ucs = fl_utf8decode(p,e,&len);
  ... dst[count] = (wchar_t)ucs;

This works well, as long as there is no need to use surrogate
pairs (ucs >= 0xd800 && ...), so it went undiscovered until now.
Thanks for the hint, so that I looked at it.

I changed a few more occurrences of WIN32 to WIN32 || __CYGWIN__
(or the reverse logic), but I didn't touch anything beyond pure
UTF-8 string conversions.

I'm not sure what (if anything) to do with fl_utf8locale(),
because this is more something like OS handling, and in this
case __CYGWIN__ means POSIX compatibility, and we should leave
this to the Cygwin/POSIX layer (hence only "#ifdef WIN32").

But then there are:

  - unsigned fl_utf8to_mb() and
  - unsigned fl_utf8from_mb()

According to the comments, they are used for filename conversions
for OS-specific functions (they are used in filename_list.cxx).
I decided not to touch them, because I don't know what would be
correct. I remember that we recently had a patch concerning file
name handling, so I hope that this is all okay.

Albrecht
_______________________________________________
fltk-bugs mailing list
fltk-bugs@easysw.com
http://lists.easysw.com/mailman/listinfo/fltk-bugs

[fltk.bugs] UTF-8 conversions [was: STR #2505: Xft backend doesn't filter badUTF-8 characters]

Reply via email to