2012/7/9 Naena Guru <[email protected]>: >> Using Latin letters for a transliteration of Sinhala is not a hack, but >> making fonts said to be Latin-1 with Sinhalese letters instead of the Latin >> letters is a hack.
Your hack us a hack? Simply because you've absolutely not understood anything abot what is Unicode. And you are always confusing concepts. It's true that the Unicode and ISO/IEC 10646 need to use a teminology, that may not be understood the way you mean it or use it. That's why they include definitions of these terms. Don't interpret the terminology in way different to what is defined. > Well, you can characterize the smartfont solution anyway you like. The > problem for you is that it works! No it does not work. Becaue you seem to assume that we can always select the font. In most cases we cannot. So letters are encoded and given unique code points, but the font to render it is determined externally by the renderer. IOn most cases users won't want to have to guess which font to use. Notably when these fonts are also not available on their platform. There's a huge life of text outside HTML and rich text formats for documents. You absolutely want to ignore it. The UCS is there to allow exactly the separation between the presentation (fonts for example) and the semantics of the encoded texts. The UCS is also designed to avoid the dependancy between languages. Only the scripts are encoded (see the desription of what is defined as "abstract characters"). An encoding is not just a collection of bits in fixed-width numbers. Otherwise we would only see numbers on screen. The code points in the UCS are given semantics via character properties. - The "representative glyph" seen in Charts is only a very tiny part of these properties, and in fact the least used of all of them. They are only useful for producing visual charts. - What is more imporant is how each distinctive code is behaving within various mappings to support various algorithms. Including the possibility to switch fonts transparenlty without breaking the text completely (for example displaying a Greek Theta when a Latin Z awith acute was encoded, or even a Latin X when a Latin R was encoded). The encoding is what allows words and orthographies to be recognized, still independantly of the font styles and other optional typographic effects (because all scripts are made of an almost infinite number of possible styles, that users will still read as part of the script while still also recognizing the orthography used and the language). Unicode and ISO/IEC 10646 do not encode glyphs directly in the UCS. They do not encode orthographies, they do not encode languages. What is encoded is a set of correlated properties. One of these properties is a numeric property name "code point" which is also independant of the final binary encoding (it could be one of the standard UTF's or even a legacy 8-bit encoding with a mapping to/from the UCS) ! > Sorry for this Kindergarten lesson, but you should understand the role of > the font. A font is a support application at the User Interface level. Yes. But Unicode does not really matter about which font you will use. Provided that they map glyphs coherently in such a way that Sinhalese letters will not be rendered instead of the intended Latin letters EVEN if a Sinhalese font has been selected. > When text moves between > applications and between computers, they travel as numeric codes > representing the text in the form of digital bytes. The computer can't say > French from Singhala. Note relevant to out discussions in this Unicode mailing list. We don't care about that and SHOULD not even car about. Unicode supports a wide range of possible binary encodings. They don't change however the code point assignments which are the central point from which all other properties are mapped in al applications, including for rendering (but not limited to it). >> Oh, thank you for the generosity of allowing me use of the entire Latin >> repertoire. You don't have to tell that to me. We need to tell it again to you because you absolutely want to restrict the repertoire to an 8-bit subset, when you ALSO contractictorily say that you want to support thousands of aksharas. Unicode supports millions of characters and tens of millions of glyphs (possibly more) using a 21-bit encoding space (actually less than 20 if we leave aside the PUA which are also supported separately but with an extremely free encoding with almost no standard properties). This space is still representable with various encodings (some are part of the Unicode and ISO/IEC 10646 standards, some are supported in the references, and there are tons others, incljuding many legacy 7-bit or 8-bit SBCS encodings, from ISO or from proprietary platforms, or from national standards not part of ISO, e.g. those developed in China PR such as GB18030, or in India such as ISCII, plus many older standards that have since been deprecated and are no longer recommended). But ISO 8859-1 is DEFINITELY NOT a mer set of 8-bit numeric values. You are confusing anonymous "bytes" with what is a text encoding. The ISO-8859-1 enoding has a strong and unambuguous mapping to/from the UCS that when it will be used, ONLY the Latin letters will be displayed (ano not every possible ones). It is strictly IMPOSSIBLE to encode any Sinhalese letter (or diacritic or digit or ligatured conjunts, or any more complex akshara, or simpler mora) using ISO-8859-1. Trying to do that will clearly violate the ISO 8859-1 standard (and finally the UCS standards due to the strong mapping from ISO 8859- to the UCS). Your attempt to do that is then clearly a hack (which also violates the separation from the encoding from the rendering style, and limit you to ONLY some specific rich-text formats). It just complicates things everywhere because nothing in your proposal will properly allow identifying the language, and authors will be restricted to rich text formats that allow them to use very specific fonts that are not portable to all platforms. Your proposal also requires them to tag these specific fonts everywhere, and switch fonts constantly in documents. Creating composite documents that mix contents from various sources will be almost impossible. In addition it will severely limit them in their design choices. Users will be required to use only visual renderers. (Aural or Braille renderers won't work at all, they are not basing their transoforms on font styles found in rich text documents). It will be impossible to name files in a filesystem (filenames do'nt have any capability to specify a required font name). Your system is then NOT accesssible. -- if you don't remember what is "mojibake" look for this term in a search engine... Mojibake will immediately reappear (this problem that exploded in the 1980's was largely due to the need for interoperability across operating systems. With the developement of the UCS and it now almost universal adoption, mojibake are now a disappearing problem (but it's a long process because there are still tons of legacy applications based on legacy SBDC/DBCS encodings (even some not endorsed by a national or international published standard, or defined in proprietary systems and not documented at all publicly) > I have traveled quite a bit > in the IT world. Don't be surprised if it is more than what you've seen. Visibly not. First you don't even know what is the ISO 8859-1 standard that you want to break, when it has strictly no problem in relation with the newer UCS system. >> My solution is supported by two standards: ISO-8859-1 and Open Type. >> ISO-8859-1 is Basic Latin plus Latin-1 Extension part of Unicode standard. >> It is not supported by ISO-8859-1. ISO-8859-1 isfor Latin letters, not >> Sinhalese ones. NO. NO NO. ISO 859-1 is definitely NOT usable for encoding any Sinhalese letters or script. May be it's usable as in a romanization of the Sinhalese language, but this is an orthogonal problem to which Unicode does not mandate or standardizes anything. >> Bottom line is this: If Latin-1 is good enough for English and French, it >> is good enough for Singhala too. >> >> No, because Sinhala is not written with Latin letters. > > Declarations like that won't work in a technical discussion. You need to > explain. Singhala is a language. Singhala native SCRIPT is the traditional > way it is written. And everywhere you are attempting to mix the concept of the language with the concept of the script and writing system. This stops your arguments immediately there. > When I write Jean I really entered the four code points: > 74 101 97 and 110. When you write naena, you enter 110 97 101 110 and 97. We > think the former is a name of a pretty girl and the latter is a name I made > up not in a particular language. These are not code points. Start reading the standard (notably chapter 3) to understand the concepts. >> And if Open Type is good for English and French, it is good for Singhala >> too. >> >> Of course. > > Thank you for that. But this does not approve anything about all the rest you said. OpenType can work for the Sinhalese script, provided that you don't use it to infer ligatures of the Sinhalese script or Sinhalese writing system from the encoding of Latin letters (in ISO 8859-1 or any other binary encoding mapped to the UCS). OpenType already works with the Sinhalese script using the UCS-encoded Sinhalese characters of the Sinhalese script, and without even using any intermediate romanization system (which is always lossy...).

