For the last two weeks I've been working on glib unicode backend. There were mainly two problems for switching to glib unicode: glib backend was incomplete/broken and performance issues (due to glib uses utf8 and webkit utf16 so we have to convert to utf8 to use glib and then convert the result back to utf16).
Note: Long mail, go to "Summarizing" if you are not interested in the details. - Tests still failing with glib unicode: fast/encoding/frame-default-enc.html I'm not sure why this one fails fast/encoding/yentest2.html fast/encoding/yentest.html This is a mess, it seems that character 0x5c, which is a back slash in ascii it's an ambiguous character in some Japanese encodings like Shift-JIS and can be a yen sign (U+00A5) or a back slash (U+005C). There's a workaround for this in webkit already but it doesn't seem to work for us, because ICU decodes 0x5c as U+00A5 and iconv as U+005C. We could just add a workaround to always encode 0x5c as U+005C when encoding is Shift-JS but I'm not sure it's correct because I don't know whether ICU does it always or if it depends on current locale or whatever. More info: https://bugs.webkit.org/show_bug.cgi?id=24906 http://blogs.msdn.com/b/michkap/archive/2005/09/17/469941.aspx fast/encoding/GBK/EUC-CN.html fast/encoding/GBK/chinese.html fast/encoding/GBK/cn-gb.html fast/encoding/GBK/csgb2312.html fast/encoding/GBK/csgb231280.html fast/encoding/GBK/gb2312.html fast/encoding/GBK/gb_2312-80.html fast/encoding/GBK/gbk.html fast/encoding/GBK/iso-ir-58.html fast/encoding/GBK/x-euc-cn.html fast/encoding/GBK/x-gbk.html fast/encoding/hebrew/8859-8-e.html fast/encoding/hebrew/8859-8-i.html fast/encoding/hebrew/csISO88598I.html fast/encoding/hebrew/logical.html These ones are either not supported by iconv or contain an invalid character that ICU substitutes by another special one. These are skipped in qt. fast/encoding/char-encoding-mac.html Most of the encodings used in this test are not supported by iconv. Skipped in qt too. fast/encoding/hebrew/8859-8-e.html expected actual fast/encoding/hebrew/8859-8-i.html expected actual fast/encoding/hebrew/csISO88598I.html expected fast/encoding/hebrew/logical.html Not supported by iconv either, skipped in qt too. fast/js/sputnik/Unicode/Unicode_320/S15.5.4.16_A1.html fast/js/sputnik/Unicode/Unicode_500/S15.5.4.16_A1.html fast/js/sputnik/Unicode/Unicode_500/S15.5.4.18_A1.html fast/js/sputnik/Unicode/Unicode_510/S15.5.4.16_A1.html fast/js/sputnik/Unicode/Unicode_510/S15.5.4.18_A1.html The problem is that g_unichar_tolower() only works for characters that are G_UNICODE_UPPERCASE_LETTER or G_UNICODE_TITLECASE_LETTER. These tests are using 0x2160..0x216F (G_UNICODE_LETTER_NUMBER) and 0x24B6..0x24CF (G_UNICODE_OTHER_SYMBOL). I filed a bug in glib: https://bugzilla.gnome.org/show_bug.cgi?id=633436 fast/text/find-kana.html fast/text/find-russian.html fast/text/find-soft-hyphen.html fast/xsl/sort-unicode.xml The problem here is the algorithm used when searching text in non-case sensitive mode. We just use casefold() to convert the string into a form that is independent of case, that's seems to be what firefox does too. But ICU implements the search algorithm of strength 3, which means that, for example, accented characters match to its non accented version. More info: http://www.unicode.org/reports/tr10/#Searching http://userguide.icu-project.org/collation/icu-string-search-service https://bugs.webkit.org/show_bug.cgi?id=48056 This is probably the most difficult bug to fix. fast/dom/Range/range-expand.html This one fails only for the Chinese words due to this pango bug: https://bugzilla.gnome.org/show_bug.cgi?id=97545 Bug is open since 2002 so . . . fast/url/host.html This is a bug in glib: https://bugzilla.gnome.org/show_bug.cgi?id=633350 - Performance improvements JavaScriptCore/wtf/unicode/glib/UnicodeGLib.cpp Problematic functions are foldCase(), toLower() and toUpper() the version that convert an string. Functions that convert a single character are not a problem because we have a g_unichar_ function in glib, except for foldcase. When converting a string we need to convert between utf8 and utf16. I haven't done any benchmark so I don't know the real impact of these conversions in performance. Already proposed an improvement here: https://bugs.webkit.org/show_bug.cgi?id=48625 UChar32 foldCase(UChar32 ch) The problem here is that we don't have an equivalent version in glib, because the foldcase of some characters is represented by more than one character. ICU and qt have a foldCase() method for a single character that only work for characters that have 1 to 1 mapping. I talked to behdad to see whether we could do the same in glib: <KaL> behdad: I'm wondering why there isn't g_unichar_casefold, wouldn't it make sense even though it wonly works for single-character mapping? <behdad> KaL: you know the answer already :) <KaL> behdad: no, I don't <KaL> :-P <behdad> KaL: you said it. it's hack... <behdad> slightly better than one that only works for ascii... <KaL> yes well, what it's a hack is what we have to do in webkit to emulate it <behdad> yes, unfortunately our unicode support is far from complete :( <KaL> behdad: it's actually tolower + a few special cases <behdad> oh, I see what you mean... <behdad> KaL: well, that webkit has wrong design for unicode is not glib's problem A workaround might be to copy (or generate our own) the table of special case folding. - Summarizing: Most of the test cases that are failing are corner cases or bugs in pango/glib. We would need to measure times to know whether performance is actually an important issue or not. So, maybe it's not ready to switch to glib unicode backend by default, but we can probably remove the message that says it's slow and incomplete. Or we could try to make it default and see what happens. Sorry for the long mail. -- Carlos Garcia Campos http://pgp.rediris.es:11371/pks/lookup?op=get&search=0xF3D322D0EC4582C3
signature.asc
Description: PGP signature
_______________________________________________ webkit-gtk mailing list [email protected] http://lists.webkit.org/mailman/listinfo.cgi/webkit-gtk
