On 11/12/2003 17:55, Philippe Verdy wrote:

Peter Kirk wrote:


I am sure that some tricks could be found to simplify the indexing if necessary, e.g. using PUA or non-character code points indexed into a special table to replace DGCs which cannot be represented as a single character. (There are plenty of non-characters available as you need to use UTF-32 here to avoid exactly the same problems with surrogates.)



You're quite optimistic here: the total number of DGCs that can be encoded in Unicode goes far beyond the capacity of PUAs and even of the whole Unicode range itself.

I did not try to count them for the simplest cases, but possible DGCs are
nearly infinite:
- there's no upper limit for the number of diacritics you can combine with a
base character
- there's no limit in the number of base characters that can be used to
build Hangul syllables.


More than that, actually infinite, as any one diacritic may be repeated.

So how will you allocate PUAs? Using an internal lookup table stored with
the document that use these PUAs that translates only the DGCs used
internally into single PUAs ? ...

Well, I wasn't actually thinking of storing these with the document, although I suppose they could be if I were to choose an approach which I don't like of storing documents in a private format. (This wouldn't even be an efficient format if I am mostly using UTF-32.) I was thinking rather of translating complex DGCs into PUAs etc on input of each document individually, and keeping in memory a table mapping these PUAs to character strings. Actually it is probably better in this case to use non-characters as there may be PUAs in the document already, and this avoids some of the problems you noted. As I have 65519 whole planes of non-characters available which can support more than 4 billion distinct DGCs, I think I will have enough space for any practical document.

... Now how will you implement indexing with these
private private PUAs which change of semantics across documents? What is the
relevant scope for these PUAs?


The scope would be one instance of a document opened in an application. As for implementation details, that is for implementers to sort out. This was a tentative suggestion which I made in passing, not something which I had thought through in detail.

In the 19th century Charles Babbage wrote, concerning his prototype computers:

Propose to an Englishman any principle, or any instrument, however admirable, and you will observe that the whole effort of the English mind is directed to find a difficulty, a defect, or an impossibility in it.

I regret that we English may have exported this unfortunate trait.


--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to