Re: Unicode and chunk expressions

Dar Scott Tue, 17 May 2005 14:52:52 -0700


On May 17, 2005, at 3:06 PM, Dar Scott wrote:

You can convert to UTF8 and then work with the chunk expression for line and item and (maybe word).


I forgot to say not char.

In UTF8, all characters in the ASCII range including the chunking syntax characters have the high-bit zero. All other characters consist of bytes with the high-bit one. That means you can't get any false syntax characters.

I made this little handler:

on mouseUp
  get the unicodeText of field "field"
  put uniDecode(it,"UTF8") into utf8Text
  get binaryDecode("H*",utf8Text,h)
  put utf8Text & lf & h
end mouseUp

I put this into field "field":
a 3, b 21, ü

I clicked the button and got this:
a 3, b 21, √º
6120332c20622032312c20c3bc

Broken up by characters
61
20
33
2c
20
62
20
32
31
2c
20
c3bc

As you can see, words just look longer because some characters are two to 4 bytes. You can spot them with the high bytes.

If you get item 3, you get the right text.

You can then convert results back to UTF16 (host order).

Unfortunately, a BOM can interfere with this so remove it when you convert to UTF8.

This will work with characters that require surrogates in UTF16, too, so this is a nice general solution.

If you need to work with mostly chars, then leave it in UTF16 and work with that taking two bytes (char in your script) at a time. You can very often assume you are working with characters in the primary plane and you have no surrogates and thus every two bytes is a Unicode character.

Dar

--
**********************************************
    DSC (Dar Scott Consulting & Dar's Lab)
    http://www.swcp.com/dsc/
    Programming and software
**********************************************

_______________________________________________
use-revolution mailing list
[email protected]
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Unicode and chunk expressions

Reply via email to