* Martin J. Dürst wrote: >I'm hoping to get some advice from people with experience with various >Unicode/transcoding libraries. > >RFC 3987 (the current IRI spec) has the following text: > > Note: Some older software transcoding to UTF-8 may produce illegal > output for some input, in particular for characters outside the > BMP (Basic Multilingual Plane). As an example, for the IRI with > non-BMP characters (in XML Notation): > "http://example.com/𐌀𐌁𐌂"; > which contains the first three letters of the Old Italic alphabet, > the correct conversion to a URI is > "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"
MySQL versions that do not support non-BMP characters with the native UTF-8 type are probably still around as I recall the history of that, and I've recently encountered plenty of encoders that do not support non-BMP characters either (JavaScript SHA-1 implementations come to mind), but a general purpose encoder that people might reasonably use (perhaps because there is no simple alternative) with IRI software seems a bit of a stretch. I do think it would be useful to have such an example in the specification nevertheless as people tend to test their code using examples in the specification if there is no other test suite immediately available, but it should be in some "examples" or "test cases" section, akin to section 5.4.1. in RFC 3986, without the commentary in the note you've quoted. -- Björn Höhrmann · mailto:[email protected] · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

