Peter Constable wrote at 1:00 PM on Thursday, May 6, 2004: >Dean: > >> Here are the polar choices for XML: >> >> TAGGED (but not encoded)... > >> ENCODED (but not tagged)... > >Note that tagging can be used as well as distinct encoding.
Of course - that's one reason why I said "polar choices". >> The tagged version is not a "font minefield". On the contrary, it >> explicitly provides an international standard mechanism for a level of >> specification and refinement not possible via encoding. You can, for >> example, do things like: <Phn subscript="Punic" locus="Malta" >> font="Maltese Falcon">BT 'LM</Phn>. In fact, this is precisely the >sort >> of thing for which XML was designed. > >It *is* a minefield, because the correct interpretation of the text is >dependent on particular fonts being on the recipients' systems. That >fails the criterion of plain text legibility. Of course. But that does not make tagged text a minefield - in the absence of your nice Phoenician font Hebrew would show up instead - precisely what is used by and large by Semiticists right now. >> The untagged, but differently encoded version, on the other hand, IS a >> search and text processing quagmire, especially when confronted by the >> possibility of having to deal with multiplied West Semitic encodings, >> e.g., for the various Aramaic "scripts" and Samaritan. > >Again, I find I have to disagree. It is much easier in searching to >neutralize a distinction that to infer one. And, as has been stated, if >there are distinct encodings, a given researcher can still use common >indexing for their data if that suits their purpose. I like the way Mark Davis put it (he uses the word "nightmare" for processing over-deunified text): Mark Davis wrote at 8:22 PM on Monday, May 3, 2004: >- There is a cost to deunification. To take an extreme case, suppose that we >deunified Rustics, Roman Uncials, Irish Half-Uncial, Carolingian Minuscule, >Textura, Fraktur, Humanist, Chancery (Italic), and English Roundhand. All >often >very different shapes. Searching/processing Latin text would be a nightmare. > >- There is also a cost to unification. To take an extreme case, suppose we >unified Latin, Greek, Cyrillic, Arabic, and Hebrew (after all, they have a >common ancester). Again, nightmare. > >So there is always a balance that we have to strike, looking at each situation >carefully and assessing a number of different factors. This is ALL I am trying to do here - just presenting some perspectives that may not be apparent to non-specialists, in the hopes it will make for a better informed decision. >As has been stated, the distinct needs of two communities can be served >well with two encodings; it is much more difficult to serve the distinct >needs of a second group if the distinct things they want are merged into >what the first group uses. The problem is you are seeing this as "two encodings" for "two communities". This does not represent the ground reality for West Semitic researchers, who have to deal with many "encodings" for many communities. Here is just ONE simple example of the kinds of problems we will be confronted with if we start deunifying Northwest Semitic scripts: As I've stated earlier I (and others) clearly recognize a milestone shift between pre-exilic Old Hebrew "script" (based on Old Canaanite) and post- exilic Jewish Hebrew "script" (based on Official Aramaic, which, in turn, was based on Old Canaanite). This is a very clear-cut script shift implemented by Jewish scribes at the time - almost perfectly analogous to the Fraktur to modern German script shift. If we deunify Old Canaanite/Phoenician from Hebrew, we will be faced with a dilemma. In the Dead Sea Scrolls, in the same "library", there are some Biblical manuscripts written in Old Hebrew and some Biblical manuscripts written in Jewish Hebrew, with still others written in Jewish Hebrew with Old Hebrew embedded in them. Clearly these scribes viewed Old Hebrew as a conservative, archaizing diascript of Jewish Hebrew, or conversely, Jewish Hebrew as a modern counterpart of Old Hebrew. (That this was not just merely the retention of old, maybe somewhat illegible manuscripts by trained scribes, is shown by the fact that BOTH diascripts were used in CONTEMPORARY documents.) If we have two applicable encodings available, will we use both or just one of them for these texts? If we use both, text processing just became more complicated. If we use one, we are ignoring an encoding made explicitly available for one of the diascripts. But what is worse, if somebody else has different practices than we do (and they WILL), text processing has just become a "minefield" for everybody. To me, this appears to be EXACTLY parallel to the use of Fraktur and Roman in German, with the same text processing problems in Second Temple Hebrew, were these diascripts deunified, as we would have in German, were Fraktur and Roman deunified. Clearly, unlike Mark Shoulson's experiments with modern Hebrew readers, Second Temple Hebrew readers read BOTH diacripts side by side. And we, who do research in this period, try to put ourselves in their sandals. >> Obviously there is a need, in many cases, to maintain the distinction >> between the various diascripts; the question is where should that >> distinction be introduced - at the encoding level or higher? ... > >> But, what I'm afraid of with this proposal, as I've stated before, is >> that its adoption will set a precedent that will result in a >> snowballing of West Semitic encodings, > >All I have said is that I'm persuaded that something distinct should be >encoded -- at the character encoding level, not in markup. But WHY? We need EXPLICIT reasons to justify a new encoding. Just saying that somebody wants it in XML because their font won't show up is insufficient justification, especially when the repercussions in the scholarly communities who actually use this stuff could be disruptive. >> * Separately encode Phoenician, Old Hebrew, Samaritan, Archaic Greek, >Old >> Aramaic, Official Aramaic, Hatran, Nisan, Armazic, Elymaic, Palmyrene, >> Mandaic, Jewish Aramaic, Nabataean ... > >I don't think anybody is looking for that many distinctions to be made. I certainly hope not. Respectfully, Dean A. Snyder Assistant Research Scholar Manager, Digital Hammurabi Project Computer Science Department Whiting School of Engineering 218C New Engineering Building 3400 North Charles Street Johns Hopkins University Baltimore, Maryland, USA 21218 office: 410 516-6850 cell: 717 817-4897 www.jhu.edu/digitalhammurabi