https://bugzilla.wikimedia.org/show_bug.cgi?id=8445
Andrew Dunbar <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #10 from Andrew Dunbar <[email protected]> 2009-06-24 05:29:36 UTC --- "Other CJK languages are welcome to make similar fixes, I'll just concentrate on Zh here." Not all CJK languages omit interword spaces and not all languages which omit interword spaces are CJK: * Korean does use spaces between words. Quite possibly a full-width space character rather than ASCII 0x20. * Thai and Khmer (Cambodian) do not use spaces between words. * Note that both Unicode and HTML include means of indicating invisible word breaks for such languages. Then again a quick Google seems to indicate that the HTML "WBR" tag is neither official nor interpreted to have the same semantics by everybody. Another approach would be to harvest Han compounds from souces such as EDICT, CEDICT, and the various Wiktionaries. Google does morphological analysis to determine which strings of Han characters are compounds that should be treated as words. Andrew Dunbar (hippietrail) -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
