https://bugzilla.wikimedia.org/show_bug.cgi?id=8445


Andrew Dunbar <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]




--- Comment #10 from Andrew Dunbar <[email protected]>  2009-06-24 05:29:36 
UTC ---
"Other CJK languages are welcome to make similar fixes, I'll just
concentrate on Zh here."

Not all CJK languages omit interword spaces and not all languages which omit
interword spaces are CJK:

* Korean does use spaces between words. Quite possibly a full-width space
character rather than ASCII 0x20.
* Thai and Khmer (Cambodian) do not use spaces between words.
* Note that both Unicode and HTML include means of indicating invisible word
breaks for such languages. Then again a quick Google seems to indicate that the
HTML "WBR" tag is neither official nor interpreted to have the same semantics
by everybody.

Another approach would be to harvest Han compounds from souces such as EDICT,
CEDICT, and the various Wiktionaries. Google does morphological analysis to
determine which strings of Han characters are compounds that should be treated
as words.

Andrew Dunbar (hippietrail)


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to