On May 17, 2005, at 3:06 PM, Dar Scott wrote:
You can convert to UTF8 and then work with the chunk expression for line and item and (maybe word).
I forgot to say not char.
In UTF8, all characters in the ASCII range including the chunking syntax characters have the high-bit zero. All other characters consist of bytes with the high-bit one. That means you can't get any false syntax characters.
I made this little handler:
on mouseUp
get the unicodeText of field "field"
put uniDecode(it,"UTF8") into utf8Text
get binaryDecode("H*",utf8Text,h)
put utf8Text & lf & h
end mouseUpI put this into field "field": a 3, b 21, ü
I clicked the button and got this: a 3, b 21, ü 6120332c20622032312c20c3bc
Broken up by characters 61 20 33 2c 20 62 20 32 31 2c 20 c3bc
As you can see, words just look longer because some characters are two to 4 bytes. You can spot them with the high bytes.
If you get item 3, you get the right text.
You can then convert results back to UTF16 (host order).
Unfortunately, a BOM can interfere with this so remove it when you convert to UTF8.
This will work with characters that require surrogates in UTF16, too, so this is a nice general solution.
If you need to work with mostly chars, then leave it in UTF16 and work with that taking two bytes (char in your script) at a time. You can very often assume you are working with characters in the primary plane and you have no surrogates and thus every two bytes is a Unicode character.
Dar
--
**********************************************
DSC (Dar Scott Consulting & Dar's Lab)
http://www.swcp.com/dsc/
Programming and software
**********************************************_______________________________________________ use-revolution mailing list [email protected] http://lists.runrev.com/mailman/listinfo/use-revolution
