Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
On 7 May 2015 at 02:07, Ross Moore wrote: > Hi David, > > .. > > No disagreement to this. > > OK:-) > > In the current versions d835dc00 is two characters in luatex > and one character in xetex > as the implementation detail that xetex's underlying storage is mostly > UTF-16 is exposed. > > > This seems to be premature of XeTeX then. > It seems to be making an assumption on how those bytes > will ultimately be used. > I don't think it's so much assuming that as just choosing to use UTF16 as an internal string format tends to lead that way. Unlike UTF-8, UTF-16 can not represent all code points in the 0-10 range. If I switch to java(script) notation which does define numeric references as utf-16 units rather than unicode code points, if you do not make it an error you can encode an isolated surrogate such as "\ud835" but there is no way to store the two character sequence U+D835 U+DC00 "\ud835\udc00" is the single character U+1D400, so you can only store such character sequence if you store each text block as a sequence of separate strings keeping unpaired surrogates apart "\ud835","\udc00" which is a lot of effort for supporting input that should never appear. > If it is > not possible to prevent ^^^ or utf8 encoded surrogate pairs combining > then it is better to > prevent them being formed. > > > Hmm. > What if you have an entirely different purpose in mind for those bytes? > You still need to be able to create them and do further processing with > them. > luatex has a different mechanism for this, it allows utf8 encoding and ^^^ numeric references to access the first 256 slots _above_ "10: quoting the luatex manual: Output in byte-sized chunks can be achieved by using characters just > outside of the valid Unicode range, > starting at the value 1 114 112 (0x11). When the time comes to print a > character c >= 1 114 112, > LuaTeX will actually print the single byte corresponding to c minus > 1,114,112. > This allows explicit byte-level access to file writing (so you can write binary data such as images) without having to second guess and invert the character encoding the system uses to write characters to a file. David -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
Hi David, On 07/05/2015, at 9:26 AM, David Carlisle wrote: >> The character itself, as bytes that is, is not wrong and users should be >> able to create these. >> But preferably through macros that ensure that they come correctly paired. > > placing two character tokens representing a surrogate pair should not > though magically turn itself > into a single character. Agreed. You don't know whether you want a single character until you know what kind of output is being generated. That need not be known on input. > The UTF-8 or encoding should refer to > the unicode code point not > to the UTF-16 encoding, No disagreement to this. > > In the current versions d835dc00 is two characters in luatex > and one character in xetex > as the implementation detail that xetex's underlying storage is mostly > UTF-16 is exposed. This seems to be premature of XeTeX then. It seems to be making an assumption on how those bytes will ultimately be used. > If it is > not possible to prevent ^^^ or utf8 encoded surrogate pairs combining > then it is better to > prevent them being formed. Hmm. What if you have an entirely different purpose in mind for those bytes? You still need to be able to create them and do further processing with them. Maybe there should be a primitive that sets a flag controlling what happens to surrogates' bytes on input? It may well be that XeTeX's current behaviour is best for putting content into PDF pages; but not best in other situations. So a macro programmer should have a means to change this, when needed. > > this is no different to XML where & #xd835;& #xdc00; always refers to > two (invalid) characters not > to & #x1d400; Seems fine to me. If application software wants/needs to combine them, it can do so. > > David Cheers, Ross Ross Moore Senior Lecturer Mathematics Department | Level 2, E7A Macquarie University, NSW 2109, Australia T: +61 2 9850 8955 | F: +61 2 9850 8114 M: +61 407 288 255 | http://www.maths.mq.edu.au/ CRICOS Provider Number 2J. Think before you print. Please consider the environment before printing this email. This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie University. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
> The character itself, as bytes that is, is not wrong and users should be able > to create these. > But preferably through macros that ensure that they come correctly paired. placing two character tokens representing a surrogate pair should not though magically turn itself into a single character. The UTF-8 or encoding should refer to the unicode code point not to the UTF-16 encoding, In the current versions d835dc00 is two characters in luatex and one character in xetex as the implementation detail that xetex's underlying storage is mostly UTF-16 is exposed. If it is not possible to prevent ^^^ or utf8 encoded surrogate pairs combining then it is better to prevent them being formed. this is no different to XML where & #xd835;& #xdc00; always refers to two (invalid) characters not to & #x1d400; David -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
Hi Arthur, On 07/05/2015, at 8:04, Arthur Reutenauer wrote: > While working on these bugs, we also discussed how surrogate > characters were handled in XeTeX. Surrogate characters are the 2048 > code points that are used in UTF-16 to encode characters with code > points above 65536: a pair of them makes up one Unicode character; > however they're not meant to be used in isolation, even though they have > code points like other characters (they're not just byte sequences). > > Right now, XeTeX allows isolated surrogate characters, and also > combines sequences such as d835dc00 into one Unicode character. > We want to flag the former case but are not sure how: should we make the > characters invalid (with catcode 15)? That would definitely be wrong. The character itself, as bytes that is, is not wrong and users should be able to create these. But preferably through macros that ensure that they come correctly paired. IMHO, this is a macro issue, not an engine issue. The same kind of thing applies with combining accents and diacritics. I've written macros that take an argument and follow it with a combining character. This is useful for generating correct UTF8 bytes to put into XML packets, as needed for the XMP Metadata that is required in PDF files that must validate for ISO specifications. Similar macros could be used to construct upper-plane characters from surrogates, given only the math style and Latin letter. For these, single surrogate characters will be needed in the macro definitions, with the ultimate matching pair to be determined algorithmically, probably using an \ifcase instance. Single characters thus need to be able to be input, so as to create the macro definition. OK, a clever macro programmer can change the catcodes to become valid local to the macro definition. But that is really complicating things. > Or we could map them to the > standard "unknown" character (U+FFFD). The latter case is more nasty > and should definitely be forbidden -- the ^^ notation should only be > used for "proper" characters (so instead of the above, the Unicode code > point of the resulting Unicode character should be used, in this case > ^1d400). I disagree. The ^^ notation can be used in macros to create the required bytes, for writing out into a file other than the .dvi or .pdf output. pdfTeX (or other engine) then can cause that file to become embedded as a file object stream in the final PDF. > > Any thoughts? > >Best, > >Arthur Hope this helps, Ross -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
On 6 May 2015 at 23:04, Arthur Reutenauer wrote: > While working on these bugs, we also discussed how surrogate > characters were handled in XeTeX. Surrogate characters are the 2048 > code points that are used in UTF-16 to encode characters with code > points above 65536: a pair of them makes up one Unicode character; > however they're not meant to be used in isolation, even though they have > code points like other characters (they're not just byte sequences). > > Right now, XeTeX allows isolated surrogate characters, and also > combines sequences such as d835dc00 into one Unicode character. > We want to flag the former case but are not sure how: should we make the > characters invalid (with catcode 15)? Or we could map them to the > standard "unknown" character (U+FFFD). The latter case is more nasty > and should definitely be forbidden -- the ^^ notation should only be > used for "proper" characters (so instead of the above, the Unicode code > point of the resulting Unicode character should be used, in this case > ^1d400). > > Any thoughts? > A major difference between using catcode 15 and the engine's input filter substituting U+FFFD is that the former could be over-ridden at the macro layer. Whether that's a good thing or not depends a bit on what happens if a document puts the catcodes back to (say) 12. if you just get undefined characters and missing glyphs, then you get what you ask for and probably it should be allowed just because. If the internals can't reliably deal with an unpaired surrogate (eg it crashes some font library api) then the engine had better ensure it doesn't easily happen and FFFD is as good as anything probably. If you do go for catcode 15, then (as suggested in the thread on unicode-letters.def) it could be set in the macro layer or the engine could initialise these catcodes. Doing it at the macro layer is probably more in the spirit of the traditional catcode initialisation which is very minimalist. As you say, combining d835dc00 into one token just wrong, and I think it should do (twice) whatever you decide to do for unpaired surrogates. David -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
While working on these bugs, we also discussed how surrogate characters were handled in XeTeX. Surrogate characters are the 2048 code points that are used in UTF-16 to encode characters with code points above 65536: a pair of them makes up one Unicode character; however they're not meant to be used in isolation, even though they have code points like other characters (they're not just byte sequences). Right now, XeTeX allows isolated surrogate characters, and also combines sequences such as d835dc00 into one Unicode character. We want to flag the former case but are not sure how: should we make the characters invalid (with catcode 15)? Or we could map them to the standard "unknown" character (U+FFFD). The latter case is more nasty and should definitely be forbidden -- the ^^ notation should only be used for "proper" characters (so instead of the above, the Unicode code point of the resulting Unicode character should be used, in this case ^1d400). Any thoughts? Best, Arthur -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
On 4 May 2015 at 16:27, Jonathan Kew wrote: > ... > > A fix for this bug, so that \string generates single Unicode characters > even for values above U+, is currently on the utf16-issues branch in > the XeTeX repository on sourceforge.[1] > > A bug with characters above U+ within \scantokens[2] is also fixed on > this branch. > > > There are also a couple of new primitives available in this branch: > > (1) \Uchar > > where is a number in the range 0.."10 > > is an expandable command that produces a character token with the given > Unicode value, and catcode=12 (other character). This is different from > TeX's \char primitive from a macro-programming point of view, in that it > expands to a character token rather than being a typesetting command. > > (I believe this is similar to the \Uchar command available in luatex.) > > > (2) \Ucharcat > > where is a number in the range 0.."10 > and is a number in the ranges 1..4, 6..8, 10..12 > > is an expandable command that produces a character token with Unicode > value and catcode . This allows macro programmers to > create character tokens with various catcode assignments much more easily > than is otherwise possible. > > > Feedback and testing is invited; but note that currently this will require > pulling the code from sourceforge and building the new xetex, as binary > packages are not available. > > If testing in the next day or two doesn't uncover any alarming problems, > these fixes/features will be merged to the master branch and to TeXLive, in > preparation for the TL2015 release. > > JK > > Thanks for this! I've build the version from this branch and it does appear to address all the test cases I had for characters above ", and \Uchar(cat) will be incredibly useful in defining expandable operations on token lists, and for code that should be compatible with both luatex and xetex. David -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
On 23/4/15 20:59, David Carlisle wrote: I can confirm that \string does convert character tokens to two tokens giving the UTF-16 representation. With the attached file luatex produces 90,33 34,33 233,33 233,33 65530,33 65537,33 65537,33 which is in each case the unicode value of the character followed by that of ! xetex produces 90,33 34,33 233,33 233,33 65530,33 55296,56321 55296,56321 where the last two lines show that \string has generated U+D800 U+DC01 which does correspond to the UTF-16 encoding of U+10001 confirming that \string on a character token has produced two tokens that have been picked up separately as #1 and #2 of the \test macro. A fix for this bug, so that \string generates single Unicode characters even for values above U+, is currently on the utf16-issues branch in the XeTeX repository on sourceforge.[1] A bug with characters above U+ within \scantokens[2] is also fixed on this branch. There are also a couple of new primitives available in this branch: (1) \Uchar where is a number in the range 0.."10 is an expandable command that produces a character token with the given Unicode value, and catcode=12 (other character). This is different from TeX's \char primitive from a macro-programming point of view, in that it expands to a character token rather than being a typesetting command. (I believe this is similar to the \Uchar command available in luatex.) (2) \Ucharcat where is a number in the range 0.."10 and is a number in the ranges 1..4, 6..8, 10..12 is an expandable command that produces a character token with Unicode value and catcode . This allows macro programmers to create character tokens with various catcode assignments much more easily than is otherwise possible. Feedback and testing is invited; but note that currently this will require pulling the code from sourceforge and building the new xetex, as binary packages are not available. If testing in the next day or two doesn't uncover any alarming problems, these fixes/features will be merged to the master branch and to TeXLive, in preparation for the TL2015 release. JK [1] https://sourceforge.net/p/xetex/code/ci/utf16-issues/tree/ [2] https://sourceforge.net/p/xetex/bugs/80/ -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex