[ Please don't copy me on replies; the place for this is the mailing list, not my inbox, unless you want to go off-list. ]
On 2012-07-11, Hans Aberg <[email protected]> wrote: >Unicode has added all the characters from TeX plus some, making it >possible to use characters in the input file where TeX is forced to >use ASCII. This though changes the paradigm, and it is a question of >which paradigm one wants to adhere to. This doesn't seem to make much sense, or have much truth, to me. TeX does not have a notion of character in the Unicode sense. TeX is a (meta-)programming language for putting ink on paper. It ultimately produces instructions of the form "print glyph 42 from font cmr10 at this position". It does not know or care whether the glyph happens to be a representation of some Unicode character. (It also isn't tied to ASCII for its input - when I first used TeX, it was on an EBCDIC system.) There are many characters that TeX users use that are not in Unicode. Indeed, you can't even correctly represent the name of the system in Unicode, or any other plain text system - an entirely deliberate choice by Knuth to emphasise that TeX is a typesetting program, not a text representation format. Because TeX is agnostic about such matters, one can set up any convenient encoding for the input data (which is really the source code of a program). For example, I have written documents in ASCII, Latin-1, Big5, GB, UTF-8 and probably others. This is very convenient; but it's only a convenience. If one uses UTF-8, then one has the problem of how to deal with the case where Unicode trespasses on TeX's territory, by specifying font styles. This is not hard: for example, the obvious thing to do is to arrange for the Unicode MATHEMATICAL SMALL ITALIC M to be an abbreviation for \mathit{m}, and so on. Note, incidentally, that this is not the same as the meaning of a plain ASCII (or EBCDIC) "m" in TeX. In TeX math mode, the meaning of "m" is dependent on the currently selected math font family: just as in plain text, the font of of "m" depends on the currently selected text font. One problem, of course, is that there is no MATHEMATICAL ROMAN set of characters. This is one of the biggest botches in the whole mathematical alphanumerical symbol botch. If you encode semantic font distinctions without requiring the use of higher-level markup, then you need to encode also letters that are semantically distinctively roman upright. The square root of -1 cannot be italicized in the statement of a theorem, unlike all the "i"s that appear in the text of the theorem. Yet Unicode provides no way to mark this semantic distinction between the characters, and has to rely on the higher-level markup distinguishing maths (to which some font style changes should not be applied) from text (in which they should). A more general problem is that which font styles are meaningful, depends on the document. For example, I give lectures and talks, and I set my slides in sans-serif. As I don't (usually) use distinctive sans-serif symbols in my work, the maths is all in sans-serif too: form, not content. But what then should I see if I type a Unicode mathematical italic symbol in my slides? Serif, or sans-serif? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

