https://bugzilla.wikimedia.org/show_bug.cgi?id=48630
Web browser: ---
Bug ID: 48630
Summary: Data model needs characters, not code points
Product: VisualEditor
Version: unspecified
Hardware: All
OS: All
Status: UNCONFIRMED
Severity: normal
Priority: Unprioritized
Component: Data Model
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected], [email protected],
[email protected]
Classification: Unclassified
Mobile Platform: ---
At present, the VisualEditor treats UTF-16 code points as if they were
synonymous with abstract characters. Here are two cases where this causes bugs:
1) UTF-16 uses a surrogate pair to represent each Unicode character above
U+FFFF. For instance, U+282E2 ('elevator' in Cantonese) is a single character
represented in Javascript as "\uD860\uDEE2". In a plain textarea, this behaves
like a single character from the point of view of the user. However in the
VisualEditor, cursoring and backspacing requires two presses; and after
cursoring once, any text typed will go in the middle of the surrogate pair,
creating invalid UTF-16. (see The Unicode Standard, Version 6.2, Section 3.8,
Surrogates).
2) Combining accents can be used in sequences to build up abstract characters.
For example, the Javascript string "m\u0300" represents a single abstract
character (m with grave accent). In a plain textarea, this behaves like a
single character when cursoring, but like two characters when backspacing (so
the first backspace just removes the accent). However in the VisualEditor,
cursoring requires two presses; and after cursoring once, any typed text will
go between the letter and the accent, creating an inappropriate dangling
combining accent.
These kinds of issues occur because the DataModel uses Arrays with code point
elements, say ['\uD860', '\uDEE2', ..., 'm', '\u0300']). My hunch is that this
is slightly too low level, and it should instead use abstract character
elements, say ['\uD860\uDEE2', ..., 'm\u0300'], where each element represents a
whole character.
A good start would be to abstract out away calls to string.split( '' ) into a
single function like this:
ve.splitCharacters = function ( value ) {
return value.split( /(?![\uDC00-\uDFFF])/ ); // don't split surrogate
pairs
};
The rest of the codebase should call this function to perform splits, and then
not assume that data[i] is a single character. Then we can refine
splitCharacters as needed.
Alternatively, since the overwhelming majority of characters will in fact be
single code points, perhaps the DataModel structure could "encode" the
exceptional multi-code point characters as objects, so that 'typeof data[i] ===
"string"' can still detect the simple cases.
This sounds like a big change for a small issue, but I think it would avoid
problems in the future. With a character representation, you can safely perform
useful operations like splicing and truncating without having to check the
surrounding context very carefully every time.
--
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l