Peter Constable responded to Peter Kirk: > > From: Peter Kirk [mailto:[EMAIL PROTECTED] > > Sent: Friday, May 28, 2004 1:40 PM > > > > Well, I understood the semantic content of a text to be the meaning of > > the words...
[Kirk continuing, to provide more context... > > , not the indication of which script they are written in. ... > > But a Hebrew or Moabite > > word has the same meaning whether it is written with Hebrew or > > Phoenician glyphs. That was my argument. Now you may wish to argue that > > plain text is intended to convey more information than that, also the > > information about what script it is written in, but again that begs the > > question about the what is a script distinction. ] Constable responded: > Unicode encodes characters, not languages, not morphemes, not senses of > words. The character semantics of "Sally" and of "Sally" transliterated > into Hebrew are not the same. This also struck me as a major misunderstanding in Peter Kirk's note, which may underlie some of the problem this thread has been having in coming to *any* conclusions whatsoever. Take a look at page 343 of the Unicode Standard, which shows a line from the Codex Argenteus in Gothic script. That line is then *transliterated* into the Latin script, and a translation is also given. Taking just the last word, we have the Gothic: <10340, 10342, 10330, 1033F, 10346, 10334, 10344, 10330, 1033F> PAIRTHRA, RAIDA, AHSA, URUS, FAIHU, AIHVUS, TEIWS, AHSA, URUS and the Latin: <0070, 0072, 0061, 0075, 0066, 0065, 0074, 0061, 0075> p r a u f e t a u Now *whichever* way this is represented, this is still the *same* Gothic *language* word, and it means the same thing: prophet. However, the *Unicode* sense of the semantics of these strings is different. Unicode semantics refers to the identity of the encoded characters. The semantics of U+10340 is the 17th letter of the Gothic alphabet (of the Gothic script), named PAIRTHRA. The semantics of U+0070 is the 16th letter of the Latin (and English) alphabet (of the Latin script), named 'pee' (or P). The Unicode semantics of those two strings is distinct, regardless of the fact that both represent the same word in the same language. Conformance to the Unicode Standard requires that processes respect the (Unicode) semantics of such strings. That means that if you are handed <10340, 10342, 10330, 1033F, ...> you recognize that this is a sequences of characters in the Gothic script as encoded in the standard -- not Devanagari or Hangul, for example, or OCR symbols. If handed <0070, 0072, 0061, ...> you must recognize that this is a sequence of characters in the Latin script as encoded in the standard -- not Devanagari or Hangul, or OCR symbols, or, for that matter, Gothic. However, conformance to the Unicode Standard does not prevent a process which is *aware* of the meaning of Gothic text, either in some relative simple and straightforward way (e.g. a transliterator) or in some deep and profound way (e.g. a machine translator) from determining that there is an *equivalence* to be made here -- in the first instance a letter-by-letter equivalence between the two scripts, and in the second instance a lexical equivalence between the words represented and their meanings. Now I suspect that the Semitic palaeographers in this discussion are going to raise their eyebrows and assert that this whole concept of "semantics" for the characters is tautologous and meaningless. In essence an encoded character has a distinct semantics in the Unicode Standard only and precisely *because* it is encoded separately as a character. And the exceptions are asserted to be exceptions by specification of *canonical equivalence*, which equates the semantics of either distinct sequences or those few instances where the committee has effectively determined that the *same* character was encoded more than once in the standard (for various historical reasons). Nevertheless, that is the way the standard works. It is, in fact, the way *all* character encoding standards work -- the nature of the issue is simply more profoundly obvious for the Unicode Standard because of its intended universal scope, which means it dabbles in dozens of scripts that no other character encoding standard has ever attempted to come to grips with. Now the architectural issue for the encoding of the Gothic script in the Unicode Standard is very closely analogous to the situation that bears on the question of the encoding of the Phoenician (~ Old Canaanite) script. The *need* and prospective benefits for encoding Gothic as a script distinct from Latin are roughly parallel to those suggested for Phoenician. The prospective costs for scholars involved in the study of Gothic text are roughly parallel to those raised by the Semiticists: the need to fold any Gothic text to the more usual Latin transliterations, when encountered or when searching. If this parallel is not apparent to people, then I submit that you may not really understand the Unicode Standard, its intent, or how the committees approach their encoding tasks. And no matter how many times Peter Kirk begs the question of what is a script distinction, what it comes down to in the Unicode Standard is that a script distinction is a distinct encoding of a script, neither more nor less. It does not correlate directly to a graphologist's or palaeographer's definition (if they have one) of what a script is, nor can it be defined, a priori, axiomatically. It comes down to decisions about potential usefulness of separate encoding of certain candidate collections of related writing symbols, based on historical identity, technical considerations of how various desired processes may interact with the encoding choices, and input from (sometimes competing) interested parties who may or may not want a separate encoding for some entity, based on the way they have traditionally interacted with materials of relevance. --Ken

