On 20/03/2014 15:37, Geoff Canyon wrote:
I have a field that has been populated by setting the unicodetext. Some
lines actually need unicode -- umlauts, enye, etc. -- and others are plain
ascii.

What's the most efficient way to count how many lines are plain and how
many actually need unicode?

Could you (when all the uni-7 stuff has settled down and we have proper conversions etc) convert text from unicode to UTF8, and also to an 8- or 7-bit representation, and compare the number of bytes in these two representations?

If the lengths are the same in both the UTF8 and ISO-8859-1 versions, then all the characters could be represented in a single byte in UTF8.

That probably means in fact that all the characters are in ISO-8859-1 (I think that the one-byte characters in UTF8 approximately correspond to ISO-8859-1, but I'm prepared to be corrected).

Depending your definition of 'plain', that may suffice. If your API actually needs plain ASCII, then you can convert one more time, to ASCII, and compare the actual text of the ISO-8859-1 and ASCII versions - if they differ that should be because some characters that aren't in ASCII have been replaced with "?", so it ain't ASCII. (Unless the textDecode system is cute and eg tries to replace 'smart' quotes with plain ones...)

Ben

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to