2011/8/21 Doug Ewell <[email protected]>: > For once, I am in strong agreement with something Philippe had to say: > >> We really need a raliable way to transport a PUA agreement in such a > way that it can be understood by a computer. > > I don't necessarily agree that fonts, or (especially) any particular font > technology, are the one and only way to accomplish this, because there's more > to character handling than display. Maybe some sort of open format could be > devised that could be used as a plug-in to a variety of existing components.
Yes but without display support, at least, all the other needs will never be addressed, because you won't have text encoded to work with. So don't even dream for example about performing plain-text search, if you don't have encoded texts to search in ! Collation is then a secondary target. Proper display is an immediate need (that even comes before the development of easy input methods, or later developments of spell checkers, content indexers, semantic analyzers, and localization of softwares to use a given script through its UI). For proper display of PUAs, all that is needed is a minimum set of character properties. I have argued, against what Peter Constable thinks, that OpenType cannot handle RTL characters with PUAs, because it has absolutely no source of information to know if a run of text is RTL or LTR, when implemeing the BiDi algorithm. OK, the mirroring property is probably not essential (because most mirrored characters are today only punctuations, that already cover a very wide range. If needed additional PUA punctuations may be added, and even coded in two mirrored code positions, even if they are not automatically mirrored according to their context : for such rare cases, using BiDi format controls around them, or other equivalent CSS embedding styles in HTML, and similar technics, will be enough. But for most of the RTL text using PUAs in long runs or mixed within other sequences of standard RTL characters (for example in the middle of words), format controls are clearly not the solution (it does not work reliably in HTML for example, if you have to split words within separate spans, and inserting those controls in the middle of words is really a nightmare). In addition it completely defeats the plain-text searchability and editability of encoded texts. This will only slow down the production of encoded texts that in fact, almost no work will be done with those PUAs. As a consequence, most texts will wait indefinitely for some encoding effort. The need will become even more urgent now that the UTC and WG2 will pass most of its time in discussing scripts that are rarely used, where the cultural knowledge will be difficult to find. If we don't have an easy way to experiment their encodings at least with PUAs, for extended periods (because there will be the need of a long research period, with conflicting experimentations), those scripts will remain unencoded in the UCS for very long. And in fact I doubt that even the WG2 or the UTC will have the resources to provide all this effort without commiting many critical errors that will be a plague for the long-term future. We absolutely need a transition mechanism, and PUAs can be part of this transition. For the same reason, the possibility offered to support external character prorperties, for characters that are not encoded or encoded in separate efforts via PUAs, and later that will be encoded with low levels of implementations and deployment for many year, would certainly help maintaining the needed resources (at UTC and WG2) at a low level, where most of the experimentations will be performed independantly without depending on the release of a putative version of the UCS finally accepting to encode the script. But even in this case, or historic scripts, the encoding effort will be hard to finalize: it is highly probable that those scripts will be encoded progressively, with a starting minimum subset about which most people will agree, and many other characters remaining that need longer experimentations or researches. Those scripts will then need to support for long a mix of standard assignments, and PUAs, at the same time, for distinct small communities that will need to share and discuss their agreement. The current problem is that there is absolutely no transition mechanism in the UCS encoding process: a character gets fully encoded with most of its essential properties becoming normative, some of them impossible to change later (even if there was an error or an unexpected caveat, that the interested communities have not had any chance to experiment before they were finally approved by the UTC and WG2). Unicode should not interfere with what users will want to do with PUAs. After all, PUAs was made specifically for that. If users need to assign their own property values to PUAs, they must be able to do that. And these properties must find a way to be representable in the current technology frameworks. If those frameworks refuse all changes (e.g. UTC/WG2 reject the assignment of new RTL PUAs, or UTC rejects the change of properties for some PUA ranges, or the OpenType promoters don't want to integrate custom character properties for PUAs assigned in fonts, or OpenType layout engine implementers refuse to include a way to use an external set of properties), there will be no other way than creating another technology that won't require any prior approval by existing non-collaborative standard bodies (or implementers), that have strong requirements that cannot even be satisfied, even with PUAs. This also means that there will be independant developments of non-compliant UCS implementations, that will be later hard to reconciliate with the current standard framework. UCS promoters and designers must admit that they had to offer a transition mechanism in order to facilitate the transition and adoption. This was in fact what happened when both the UTC and WG2 started independant efforts (but technically very different) to create an universal character set: they had to accept also transition mechanisms with roundtrip compatibility with the prior encodings (both standard encodings at ISO and its NB's, and proprietary ones that have been opened for more general use and integrated by UTC in the support). Transition mechanisms were also added later to coordinate the efforts by UTC and WG2 into a common UCS. In all these mechanisms, we did not move suddenly from legacy encodings to the UCS. In fact this is not just for the encoding at UTC and WG2, but this still exists now in the OpenType format (multiple "cmap" tables, multiple glyph formats, multiple typographic feature formats). This means the integration of optional features, and the design of a set of priorities that can enforce some common usage policy, in order to converge later to a more stable situation and a wider adoption.

