Stefan Persson wrote as follows (text >), responding to Andrew C. West (text >>).
>> Personally I think that markup may be more appropriate, given the countless possible permutations of >> combining/superscript letters that may be encountered in mediaeval texts in various languages. > Why not just add *two* characters, either to the PUA or to Unicode? > U+XXXX = COMBINING LETTER ABOVE INDICATOR > U+XXXY = SUPERSCRIPT LETTER INDICATOR > This means that U+XXXX directly followed by "a" is a combining "a" above, and that U+XXXY directly followed by "a" is a superscript "a". > This means some normalisation issues: > U+0061 U+0363 ≡ U+0061 U+XXXX U+0061 > U+00AA ≡ U+XXXY U+0061 > etc. > Stefan Well, such normalisation could be as private a matter as the allocation of the two characters to the Private Use Area. Consider please the following scenario, which is a scenario which I have devised in a creative writing manner as a fictional scenario, yet which does not seem unrealistic in relation to what might happen in practice, somewhere, sometime. Suppose please that someone wishes to transcribe the text of a medieval manuscript so as to have the text stored in a computerised format. Upon finding various characters in the manuscript such that he or she cannot enter them as Unicode characters, he or she might reasonably devise his or her own encoding list, by, say, making a handwritten list (with a view to later putting the piece of paper through a scanner to produce a graphic file) and use that encoding list in order to make human decisions as to which characters to key into the computer system, perhaps doing the keying with a program such as UniPad. The UniPad website is as follows. http://www.unipad.org It may be that the UniPad program could be customised so as to have a special soft keyboard to help the transcriber in keying the codes, yet even if that is not possible the Private Use Area codes could be entered using the character map which UniPad provides. In such circumstances the transcriber could decide to have a Private Use Area encoding of the characters of the manuscript on the basis of one Private Use Area code point for each character in the manuscript or he or she could decide to have a system which used the two operators which you suggest together with zero or more other operators and zero or more individual characters depending upon the repertoire of characters which exist in the manuscript. Certainly there are then issues of using the data once it is in a computer file, maybe some special program will need to be written (such as a small Pascal program, I am not meaning some major development project to produce a special program, just something which will do what is required for the particular transcription project), yet for someone to use two such Private Use Area encodings in order to facilitate the task of getting the information content accurately from the document into the computer, it seems a perfectly reasonable thing to do. The transcriber might need to do the transcribing of the original document during certain daytime hours at a table in a secure library environment during a time frame arranged by prior appointment and permissions. Once the transcribed data is in the computer, either keyed in while in the library or transcribed from notes made using a pencil, the transcriber and other interested people throughout the world can analyse the meaning of the text of the document almost anywhere. In such circumstances of some people trying to understand such documents, maybe using the two codes within the Private Use Area together with an ordinary TrueType font which has U+XXXX implemented so as to show a glyph of an arrow starting by going straight upwards then going steeply diagonally upwards in a bend dexter direction until it reaches the point of the arrow, (as if the back half of the arrow were as in U+2191 and the front half of the arrow were as in U+2196) and U+XXXY implemented as an arrow going straight upwards until it reaches the point of the arrow, (similar to U+2191) would be a way of researchers having a look at the transcribed text of the document in a convenient manner. I only suggest those particular glyphs as examples in this posting, please feel free to use whatever glyph designs you wish. Certainly, the use of such Private Use Area codes would only have any validity in their use amongst a group of users of the Unicode system who had agreed to use those particular Private Use Area encodings to have those meanings. Yet the use of such a Private Use Area encoding could, I feel, be very useful amongst such a group of researchers in that it would get the document transcription job done and would have the considerable advantage that if the transcribed file were to be displayed in a program such as WordPad or Word that in order to be able to understand an indication of the presence in the original document of any regular Unicode character combined above any other regular Unicode character and to understand an indication of the presence of any regular Unicode character superscripted in the original document one would only need to have a Unicode font augmented with two arrow glyphs in the appropriate code points. Well, why not go ahead and decide on two code points within the Private Use Area as values for XXXX and XXXY, post them in this list and perhaps that action will lead to that facility becoming available as a facility to document transcribers all around the world. If the code points were published in this manner, maybe a font and maybe a UniPad soft keypad which use those code points will become available in time, and so researchers transcribing documents in libraries around the world would have a lasting enhancement of the facilities available to them. This method would not produce a visually correct display, yet in order to convey meaning in a research environment, this method could help in getting the transcribing done and thus would be a valuable addition to the facilities available. William Overington 9 August 2002

