2013/12/10 Keith J. Schultz <[email protected]>: > Hi Phillip, > > I will repeat I do not know Vietnamese so I can not give you > the utf-8 sequence for it. All I can say that in utf-8 the singular letters > will > be encoded in multi-bytes whereas the english letters will be just one byte. > It has no relation to English, it is just because these characters have codepoints less than 128. In Czech some characters will be encoded as one byte, some as two bytes. The character "s" may appear in English, German, Czech, Hungarian, Spanish and many other languages. You have not answered Phillip's question what is the utf-8 sequence to distinguish English "s" from Czech "s", from Vietnamese "s", from Hungarian "s" etc.
> Now, i also, mentioned that differentiating western language poses a > different matter! > "sang" in English and "sang" in German an Austrian can not be singularly > deferentiated > as to which language it belongs to! All latin characters/letters. > Now, if "sang" is true Vietnamese and not a latinized form stand corrected! > Though I have > a feeling it is latinized! If we are talking of the phonetic reprsentation, > then a analysis > on text and belong singular text level is required. > Yes, it is true Vietnamese word. I do not know Vietnamese, I could only verify it by google translate but I know that Vietnamese uses latin alphabet with accents. And of course, some words do not have accents. It is the same in Czech, we also use accented characters but many words do not have them. And for instance, strom in Czech has different meaning that Strom in German. > It has been mentioned by others that seems to be a lack of multi-lingual utf-8 > editors(input methods) on the other side also, Xe(La)TeX lack of > implementation of > properly handling the unicode standard. > Unicode is not a typographic standard and programs from the TeX world deal with typography. If you want to achieve typographically good output, you have to use language specific rules, ie tha languages must be properly tagged. Once you tag the language, it will appear right in the Xe(La)TeX output. If you are interested in Unicode only and not in typography, why do you wish to use a typographic tool? I can explain it another way. If you wish to connect two pieces of wood, you can use either a nail or a screw. If you use a screw, you must first make a hole and the screw the pieces. However, if you do not like to make a hole and want to use a hammer only, why do you bother with a screw and do not use a nail? > It is not the standard that is the problem, but the implementation of input > and the > implementation of the output method. > > True enough, Unicode is not by far finish and is still evolving with all the > cavets > involved. Yet, the problem here does arises out of the fact that the unicode > standard > and utf-8 encoding/decoding is inadequate, but in its implementation. > The culprit is not utf-8! > > > > Am 09.12.2013 um 23:51 schrieb Philip Taylor <[email protected]>: > >> >> >> Keith J. Schultz wrote: >>> Hi Phillip, >>> >>> 1) I do not know Vietnamese! >>> >>> 2) If I did uses the proper BMP would give me the answer. >>> As "sang would be a sequence of singualr octcets, and Vietnamese >>> would use multi-byte sequences! >>> >>> case closed! Like I mentioned there are often ways used to reduce the >>> length of >>> the multibyte sequences. In that case one has to know the processed use to >>> get the proper >>> unicode character code! >> >> It is not necessary to "know" a language in order to be able to >> algorithmically determine in which language a particular stretch >> of text is written, if such algorithmic determination is possible. >> I do not "know" Hebrew, but even I know that "בית דין" is Hebrew >> and that "你好" is not. What I do not know (and what I challenge >> you to tell us" is whether "sang" is English or Vietnamese. >> >> You wrote : "for efficiency reasons, utf-8 strings are not properly >> encoded and programs assume a particular language, to save space." >> >> I invited you to tell us (the XeTeX list members, that is) what >> would be a "properly encoded utf-8 string" for the sequence >> "sang" which would enable a computer algorithm to determine >> whether that string was "sang" (Vietnamese) or "sang" (English). >> >> I am still hoping that you will be able to tell us what that >> properly encoded utf-8 string is, rather than just metaphorically >> waving your arms in the air while throwing around phrases such as >> "proper BMP", "singular octets" and "multi-byte sequences". >> >> Philip Taylor >> >> >> > > > > > -------------------------------------------------- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
