Hello,


I would like to clarify a point which, I guess, is rather related to UCS's 
character set than to Unicode properly speaking.

Say I have a sequence of codes, each representing an UCS "abstract character", 
itself beeing the representation of a text. The text is the french word "âme" 
(soul). Then, the sequence may be either (61 302 6D 66) or (E2 6D 66), the 
latter using a precomposed 'â'. Both are valid and both represent the same 
text. Correct?

How is a function supposed to return the first "true character", in the common 
sense, meaning what is sometimes called "grapheme" in Unicode literature? From 
the first sequence, it would indeed wrongly return 'a'. Without a rather 
sophisticated analyse of UCS code combinations (itself requiring knowledge 
about world scripting systems), there is no chance for such a routine to 
automagically combine codes. Correct?
Returning 'a' from the first sequence makes no sense, since it is only a part 
of a composite character (letter). [Actually, it may make sense eg for a 
linguistc app counting occurrences of 'a'-based characters -- but this is a 
rather specific need. From the second sequence, the function would indeed 
return 'â', which is correct; but only by chance, so to say, just because in 
this case the code happens to represent a whole character (grapheme). What can 
we do with a function returning isolated codes that, in the general case, 
represent parts of whole characters (what I call "marks")? Since one cannot 
guess whether a code alone represented part or whole of a grapheme in the 
original text, in my sense, nothing. Am I right on this?

Now, say a function returns the (first) position of a given grapheme, else 
fails. Using it to search 'a', then, if the 'â' is decomposed, it will wrongly 
find 'a' as first character. Correct?
If I use it to search 'â', a rather interesting situation emerges, I guess. If 
'â' happens to be expressed in the same form (de- or pre- composed in both 
cases) in input text and in source code, then the function finds it; else it 
fails. Strange, no? What do you think? It means, if I am right, that to run 
such a routine we need to produce canonical forms of both the input and the 
parameter. Then, we can could can safely compare expressions, so to say, in the 
same language. Correct?

If you python, see below a little script illustrating the issues.

Any comment, critic, or pointer welcome, thank you,
Denis

==========================================================================
t1 = u"\u00E2me"          # "âme", using precomposed character form for 'â'
t2 = u"\u0061\u0302me"    # "âme", using decomposed character form for 'â'
print "%s %s\t%s %s\t%s %s\t%s %s\t%s %s\t%s %s\t%s %s" %(
    t1 , t2 ,
    len(t1) , len(t2) ,
    t1[0],t2[0] ,
    t1.find(u'a') , t2.find(u'a') ,
    t1.find(u'â') , t2.find(u'â') ,
    t1.find(u"\u00E2") , t2.find(u"\u00E2") ,
    t1.find(u"\u0061\u0302") , t2.find(u"\u0061\u0302") ,
    )
# -->    âme âme    1 2    â a    -1 0    0 -1    0 -1    -1 0

-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com



Reply via email to