I'm really not a technical expert, but what you write rather sounds to me as if Javascripts UCS-2 implementation were broken... Thanks for the linked document.
Sz On Wed, Apr 25, 2012 at 11:41, Marc Durdin <[email protected]>wrote: > Yes, but this means that regexes with SMP don’t work (e.g. [𝒜-𝒵]), > character counts returns code units, etc. So you have to reimplement > string.length, string.charCodeAt, etc, if you don’t want to deal with > surrogate pairs (I reckon you’ve got better things to be spending your time > on).**** > > ** ** > > http://dheeb.files.wordpress.com/2011/07/gbu.pdf “Unicode Support > Shootout - The Good, the Bad & the (mostly) Ugly” by Tom Christiansen has > a great summary of some of the issues with relying on JavaScript’s internal > string manipulation (unfortunately can’t find a better working link at > present – the official training.perl.com site seems to be down). > Actually, that presentation is a fantastic place to start for understanding > many of the limitations of various programming languages’ support for > Unicode – if you haven’t read it, I’d urge you to go read it now.**** > > ** ** > > Marc**** > > ** ** > > *From:* Szelp, A. Sz. [mailto:[email protected]] > *Sent:* Wednesday, 25 April 2012 7:28 PM > *To:* Marc Durdin > *Cc:* David Starner; Unicode Mailing List > *Subject:* Re: Support for non-BMP characters**** > > ** ** > > Shouldn't it be technically possible to store Supplementary Plane > characters in UTF-16 / UCS-2 as well? Isn't that what Surrogate Pairs are > for?**** > > ** ** > > Sz **** > > On Wed, Apr 25, 2012 at 11:09, Marc Durdin <[email protected]> > wrote:**** > > Probably the most egregious example I know of is JavaScript. As far as I > know, JavaScript still only groks UCS-2. I'd love to be wrong. > > Marc**** > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On > Behalf Of David Starner > Sent: Wednesday, 25 April 2012 6:32 PM > To: Unicode Mailing List > Subject: Support for non-BMP characters > > It's been ten years since the first non-BMP characters were encoded. > How are they working in your neck of the woods? There's a lot of places > where they're working just fine, but I was facing MySQL's support. It has > had support for UCS-2 and UTF-8 limited to the BMP for a long time; now in > MySQL 5.5 there's utf16, utf32 and utf8mb4. (MySQL > 5.1 and 5.5 are the current stable releases.) But there's enough warnings > about incompatibilities with utf8mb4 to make me pause before switching my > private database to it, and I think the net will see MySQL databases with > utf8 instead of utf8mb4 as long as MySQL exists, unless they decide to push > people over to it. > > (Ada's an issue too, though not one most people will have to deal with. > While Ada 2005 added a UTF-32 string type, it left the UCS-2 string type as > is. Again, I suspect a lot of nominally Unicode Ada programs are going to > BMP-only. Of course, UTF-8 as an ASCII superset is used, stuffed into > strings labeled Latin-1; it's technically not conformant with the Ada > standard but it works so long as you don't need much string processing.) > > In any case, is the use of non-BMP characters still problematic in your > corner of the computing world or is everything looking fine from where you > are? > > -- > Kie ekzistas vivo, ekzistas espero. > > > > **** > > ** ** >

