On 07/10/2003 05:29, Marco Cimarosti wrote:

Peter Kirk wrote:


For i% = 1 to Len(utf8string$)
   c$ = Mid(utf8string$, i%, 1)
   Process c$
Next i%

Such a loop would be more efficient in UTF-32 of course, but this is still a real need for working with character counts.



If the string type and function of this Basic dialect is not Unicode-aware, then:

- Len(s$) returns the number of *bytes* in the string;

- Mid(s$, i%, 1) returns a single *byte*;

- Your Process() subroutine won't work...

If the string type and functions are Unicode aware (as, e.g., in Visual
Basic or VBScript), then I'd expect that the actual internal representation
is hidden from the programmer, hence it makes no sense to talk about an
"UTF-8 string".

_ Marco







You are correct, of course. I was assuming a Unicode-aware dialect of Basic. But my variable names are no more guaranteed to be meaningful and appropriate than are Unicode character names ;-) ; they are only required to be distinct.

I could imagine a dialect of Basic which had separate string handling functions for UTF-8 bytes and for characters. This is how the Unicode-aware version of the SIL Consistent Changes stream editor works, see http://www.sil.org/computing/catalog/show_software.asp?id=4.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to