Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Ben Rubinstein
I think this problem should be solved in LC 7 (possibly using normaliseText); 
but I need a solution that I can ship now (and it's been threatened that LC 7 
will 'fix' a 'bug' which isn't, so I'm not sure if I'll ever able to use it).


My app processes some data from - and then, re-organised, to - UTF8 text 
files. Occasionally it needs to insert a constant string; and for various 
reasons (all of them excellent) I want to specify these constant strings in 
the script.  So far, so good.  Now however one of these constant strings needs 
to contain a character which is not in ASCII.  Actually two of them.  So I 
need to express a UTF8 string in my script.  And I'm searching for an elegant 
way to do this.


My constant string used to look something like this:

   constant kMyConstantString = This is my ice cream

but now it needs to read something like
   constant kMyConstantString = This ice cream is (c) Ben and Jerry's Inc

(only with a smart apostrophe and a proper copyright symbol).

I thought I could just about manage with this

  put uniDecode(uniEncode(This ice cream is © Ben and Jerry’s Inc, ANSI), 
UTF8) into kMyConstantString


(that is, encode from ANSI to Unicode, then from Unicode into UTF8).

I tested it on Mac and it seemed to work.  The UTF8 file was generated and 
this text came out just right.



However, it turned out that when the code was compiled and run on Windows, the 
copyright symbol came out OK, but the apostrophe came out as o-tilde.


This is because uniEncode(..., ANSI) is a lie; ANSI is meaningless; 
instead it interprets the source encoding as whatever is typical for the 
operating system.  I wrote the script on Mac; in MacRoman, © is 0xA9 and smart 
apostrophe is 0xD5; in ISO-8859-1 (and UTF8), 0xA9 is ©, but 0xD5 is o-tilde.


So... what's the most elegant way to this (is there one)?  Is there any 
alternative to just looking up the UTF8 encodings and writing:


  put format(This ice cream is \xC2\xA9 Ben and Jerry\xE2\x80\x99s Inc) 
into kMyConstantString


?

TIA,

Ben

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Mark Schonewille
Hi Ben,

The apostrophe doesn't work because you convert to ASCII text that looks 
different on different platforms. If you don't use unidecode and just set the 
unicodeText of a field to your Unicode string, it should work. If that's not 
practical, you could use macToIso() to convert your string to Latin-1.

--
Kind regards,

Mark Schonewille
Economy-x-Talk
Http://economy-x-talk.com

Share the clipboard of your computer over a local network with Clipboard Link 
http://clipboardlink.economy-x-talk.com


Op 30 jun. 2014 om 16:38 heeft Ben Rubinstein benr...@cogapp.com het volgende 
geschreven:

 I think this problem should be solved in LC 7 (possibly using normaliseText); 
 but I need a solution that I can ship now (and it's been threatened that LC 7 
 will 'fix' a 'bug' which isn't, so I'm not sure if I'll ever able to use it).
 
 My app processes some data from - and then, re-organised, to - UTF8 text 
 files. Occasionally it needs to insert a constant string; and for various 
 reasons (all of them excellent) I want to specify these constant strings in 
 the script.  So far, so good.  Now however one of these constant strings 
 needs to contain a character which is not in ASCII.  Actually two of them.  
 So I need to express a UTF8 string in my script.  And I'm searching for an 
 elegant way to do this.
 
 My constant string used to look something like this:
 
   constant kMyConstantString = This is my ice cream
 
 but now it needs to read something like
   constant kMyConstantString = This ice cream is (c) Ben and Jerry's Inc
 
 (only with a smart apostrophe and a proper copyright symbol).
 
 I thought I could just about manage with this
 
  put uniDecode(uniEncode(This ice cream is © Ben and Jerry’s Inc, ANSI), 
 UTF8) into kMyConstantString
 
 (that is, encode from ANSI to Unicode, then from Unicode into UTF8).
 
 I tested it on Mac and it seemed to work.  The UTF8 file was generated and 
 this text came out just right.
 
 
 However, it turned out that when the code was compiled and run on Windows, 
 the copyright symbol came out OK, but the apostrophe came out as o-tilde.
 
 This is because uniEncode(..., ANSI) is a lie; ANSI is meaningless; 
 instead it interprets the source encoding as whatever is typical for the 
 operating system.  I wrote the script on Mac; in MacRoman, © is 0xA9 and 
 smart apostrophe is 0xD5; in ISO-8859-1 (and UTF8), 0xA9 is ©, but 0xD5 is 
 o-tilde.
 
 So... what's the most elegant way to this (is there one)?  Is there any 
 alternative to just looking up the UTF8 encodings and writing:
 
  put format(This ice cream is \xC2\xA9 Ben and Jerry\xE2\x80\x99s Inc) into 
 kMyConstantString
 
 ?
 
 TIA,
 
 Ben
 
 ___
 use-livecode mailing list
 use-livecode@lists.runrev.com
 Please visit this url to subscribe, unsubscribe and manage your subscription 
 preferences:
 http://lists.runrev.com/mailman/listinfo/use-livecode

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Peter Haworth
On Mon, Jun 30, 2014 at 7:38 AM, Ben Rubinstein benr...@cogapp.com wrote:

 So... what's the most elegant way to this (is there one)?  Is there any
 alternative to just looking up the UTF8 encodings and writing:

   put format(This ice cream is \xC2\xA9 Ben and Jerry\xE2\x80\x99s Inc)
 into kMyConstantString


Another approach is to use the htmlText property in conjunction with html
entities.  Full lists of them are available on the web but apostrophe is
apos; and the copyright symbol is copy;

Pete
lcSQL Software http://www.lcsql.com
Home of lcStackBrowser http://www.lcsql.com/lcstackbrowser.html and
SQLiteAdmin http://www.lcsql.com/sqliteadmin.html
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread J. Landman Gay
This is exactly what I've been dealing with for a week.  You need two steps : 
first check the platform and if it's Windows then run macToISO on the string.  
After that your existing conversion to UTF8 should work. 

On June 30, 2014 9:38:35 AM CDT, Ben Rubinstein benr...@cogapp.com wrote:
I think this problem should be solved in LC 7 (possibly using
normaliseText); 
but I need a solution that I can ship now (and it's been threatened
that LC 7 
will 'fix' a 'bug' which isn't, so I'm not sure if I'll ever able to
use it).

My app processes some data from - and then, re-organised, to - UTF8
text 
files. Occasionally it needs to insert a constant string; and for
various 
reasons (all of them excellent) I want to specify these constant
strings in 
the script.  So far, so good.  Now however one of these constant
strings needs 
to contain a character which is not in ASCII.  Actually two of them. 
So I 
need to express a UTF8 string in my script.  And I'm searching for an
elegant 
way to do this.

My constant string used to look something like this:

constant kMyConstantString = This is my ice cream

but now it needs to read something like
constant kMyConstantString = This ice cream is (c) Ben and Jerry's
Inc

(only with a smart apostrophe and a proper copyright symbol).

I thought I could just about manage with this

put uniDecode(uniEncode(This ice cream is © Ben and Jerry’s Inc,
ANSI), 
UTF8) into kMyConstantString

(that is, encode from ANSI to Unicode, then from Unicode into UTF8).

I tested it on Mac and it seemed to work.  The UTF8 file was generated
and 
this text came out just right.


However, it turned out that when the code was compiled and run on
Windows, the 
copyright symbol came out OK, but the apostrophe came out as o-tilde.

This is because uniEncode(..., ANSI) is a lie; ANSI is meaningless;

instead it interprets the source encoding as whatever is typical for
the 
operating system.  I wrote the script on Mac; in MacRoman, © is 0xA9
and smart 
apostrophe is 0xD5; in ISO-8859-1 (and UTF8), 0xA9 is ©, but 0xD5 is
o-tilde.

So... what's the most elegant way to this (is there one)?  Is there any

alternative to just looking up the UTF8 encodings and writing:

put format(This ice cream is \xC2\xA9 Ben and Jerry\xE2\x80\x99s Inc)

into kMyConstantString

?

TIA,

Ben

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

-- 
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Paul Dupuis
On 6/30/2014 11:17 AM, Peter Haworth wrote:
 Another approach is to use the htmlText property in conjunction with html
 entities.  Full lists of them are available on the web but apostrophe is
 apos; and the copyright symbol is copy;

Just a caution that LC (depending on engine version) does not support
all HTML entity names. For example, the entity bull; for a • is not
supported under LC 4.6.4, but is under LC 6.6.2 (and exactly what
version of LC started supporting it I haven't had the time to figure out)


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Ben Rubinstein

Hi Mark,

Thanks for the reply.  The problem is

a) I want to do this purely in script

b) A character directly entered into the script on a Mac comes out different 
on Windows (i.e. the scripts don't know what character set they're in; they're 
simply stored with no indication of character set, and on every platform 
they're interpreted as the supposedly 'native' platform for that character set).


Presumably in 7.0 I won't even need to use normaliseText, because the scripts 
will themselves be stored in Unicode or UTF8, and therefore I can use any 
Unicode character in a real script constant.  But not in 6.x.


Ben

On 30/06/2014 16:09, Mark Schonewille wrote:

Hi Ben,

The apostrophe doesn't work because you convert to ASCII text that looks 
different on different platforms. If you don't use unidecode and just set the 
unicodeText of a field to your Unicode string, it should work. If that's not 
practical, you could use macToIso() to convert your string to Latin-1.

--
Kind regards,

Mark Schonewille
Economy-x-Talk
Http://economy-x-talk.com

Share the clipboard of your computer over a local network with Clipboard Link 
http://clipboardlink.economy-x-talk.com


Op 30 jun. 2014 om 16:38 heeft Ben Rubinstein benr...@cogapp.com het volgende 
geschreven:


I think this problem should be solved in LC 7 (possibly using normaliseText); 
but I need a solution that I can ship now (and it's been threatened that LC 7 
will 'fix' a 'bug' which isn't, so I'm not sure if I'll ever able to use it).

My app processes some data from - and then, re-organised, to - UTF8 text files. 
Occasionally it needs to insert a constant string; and for various reasons (all 
of them excellent) I want to specify these constant strings in the script.  So 
far, so good.  Now however one of these constant strings needs to contain a 
character which is not in ASCII.  Actually two of them.  So I need to express a 
UTF8 string in my script.  And I'm searching for an elegant way to do this.

My constant string used to look something like this:

   constant kMyConstantString = This is my ice cream

but now it needs to read something like
   constant kMyConstantString = This ice cream is (c) Ben and Jerry's Inc

(only with a smart apostrophe and a proper copyright symbol).

I thought I could just about manage with this

  put uniDecode(uniEncode(This ice cream is © Ben and Jerry’s Inc, ANSI), 
UTF8) into kMyConstantString

(that is, encode from ANSI to Unicode, then from Unicode into UTF8).

I tested it on Mac and it seemed to work.  The UTF8 file was generated and this 
text came out just right.


However, it turned out that when the code was compiled and run on Windows, the 
copyright symbol came out OK, but the apostrophe came out as o-tilde.

This is because uniEncode(..., ANSI) is a lie; ANSI is meaningless; instead 
it interprets the source encoding as whatever is typical for the operating system.  I wrote the 
script on Mac; in MacRoman, © is 0xA9 and smart apostrophe is 0xD5; in ISO-8859-1 (and UTF8), 0xA9 
is ©, but 0xD5 is o-tilde.

So... what's the most elegant way to this (is there one)?  Is there any 
alternative to just looking up the UTF8 encodings and writing:

  put format(This ice cream is \xC2\xA9 Ben and Jerry\xE2\x80\x99s Inc) into 
kMyConstantString

?

TIA,

Ben

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Ben Rubinstein

On 30/06/2014 16:51, Paul Dupuis wrote:

On 6/30/2014 11:17 AM, Peter Haworth wrote:

Another approach is to use the htmlText property in conjunction with html
entities.  Full lists of them are available on the web but apostrophe is
apos; and the copyright symbol is copy;


Just a caution that LC (depending on engine version) does not support
all HTML entity names. For example, the entity bull; for a • is not
supported under LC 4.6.4, but is under LC 6.6.2 (and exactly what
version of LC started supporting it I haven't had the time to figure out)


Thanks Peter, thanks Paul.

Yes, ideally my feature request here

http://quality.runrev.com/show_bug.cgi?id=1372

Bug 1372 - should be an isoToHTML or similar (or 'entities' option in

uniEncode/uniDecode)

(now in its 10th great year of being ignored!) would solve this problem. 
Without it, although we know that RunRev has tables mapping HTML entities to 
character codes, we can't access them directly in script - only indirectly 
through fields, which I can't access in this context.


Ben

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Ben Rubinstein

On 30/06/2014 16:18, J. Landman Gay wrote:

This is exactly what I've been dealing with for a week.  You need two steps : 
first check the platform and if it's Windows then run macToISO on the string.  
After that your existing conversion to UTF8 should work.



Aha, good tip, thank you.

On reflection though I think I'm going to adopt a modified version of Peter's 
suggestion; use HTML entities in the 'constant' string to be unambiguous but 
readable, passing it through a function called HTMLtoUTF8 so that bit of the 
script looks clean - and then do a nasty dirty implementation of that 
function, that just handles the two entities I currently care about and throws 
an error if invoked on anything else.


I'm all about the elegance, me.

thanks to all who responded,

Ben

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Mark Schonewille
Hi Ben,

My solution will work in pre-7 and is 100% vanilla LiveCode (no idea why you 
explicitly mention again that it should be script-only). You'll have to change 
your script when you move to 7. Obviously, you could write a script for both 
versions using the do command for the 7-specific part of your script.

--
Kind regards,

Mark Schonewille
Economy-x-Talk
Http://economy-x-talk.com

Share the clipboard of your computer over a local network with Clipboard Link 
http://clipboardlink.economy-x-talk.com


Op 30 jun. 2014 om 19:24 heeft Ben Rubinstein benr...@cogapp.com het volgende 
geschreven:

 Hi Mark,
 
 Thanks for the reply.  The problem is
 
 a) I want to do this purely in script
 
 b) A character directly entered into the script on a Mac comes out different 
 on Windows (i.e. the scripts don't know what character set they're in; 
 they're simply stored with no indication of character set, and on every 
 platform they're interpreted as the supposedly 'native' platform for that 
 character set).
 
 Presumably in 7.0 I won't even need to use normaliseText, because the scripts 
 will themselves be stored in Unicode or UTF8, and therefore I can use any 
 Unicode character in a real script constant.  But not in 6.x.
 
 Ben
 
 On 30/06/2014 16:09, Mark Schonewille wrote:
 Hi Ben,
 
 The apostrophe doesn't work because you convert to ASCII text that looks 
 different on different platforms. If you don't use unidecode and just set 
 the unicodeText of a field to your Unicode string, it should work. If that's 
 not practical, you could use macToIso() to convert your string to Latin-1.
 
 --
 Kind regards,
 
 Mark Schonewille
 Economy-x-Talk
 Http://economy-x-talk.com
 
 Share the clipboard of your computer over a local network with Clipboard 
 Link http://clipboardlink.economy-x-talk.com
 
 
 Op 30 jun. 2014 om 16:38 heeft Ben Rubinstein benr...@cogapp.com het 
 volgende geschreven:
 
 I think this problem should be solved in LC 7 (possibly using 
 normaliseText); but I need a solution that I can ship now (and it's been 
 threatened that LC 7 will 'fix' a 'bug' which isn't, so I'm not sure if 
 I'll ever able to use it).
 
 My app processes some data from - and then, re-organised, to - UTF8 text 
 files. Occasionally it needs to insert a constant string; and for various 
 reasons (all of them excellent) I want to specify these constant strings in 
 the script.  So far, so good.  Now however one of these constant strings 
 needs to contain a character which is not in ASCII.  Actually two of them.  
 So I need to express a UTF8 string in my script.  And I'm searching for an 
 elegant way to do this.
 
 My constant string used to look something like this:
 
   constant kMyConstantString = This is my ice cream
 
 but now it needs to read something like
   constant kMyConstantString = This ice cream is (c) Ben and Jerry's Inc
 
 (only with a smart apostrophe and a proper copyright symbol).
 
 I thought I could just about manage with this
 
  put uniDecode(uniEncode(This ice cream is © Ben and Jerry’s Inc, ANSI), 
 UTF8) into kMyConstantString
 
 (that is, encode from ANSI to Unicode, then from Unicode into UTF8).
 
 I tested it on Mac and it seemed to work.  The UTF8 file was generated and 
 this text came out just right.
 
 
 However, it turned out that when the code was compiled and run on Windows, 
 the copyright symbol came out OK, but the apostrophe came out as o-tilde.
 
 This is because uniEncode(..., ANSI) is a lie; ANSI is meaningless; 
 instead it interprets the source encoding as whatever is typical for the 
 operating system.  I wrote the script on Mac; in MacRoman, © is 0xA9 and 
 smart apostrophe is 0xD5; in ISO-8859-1 (and UTF8), 0xA9 is ©, but 0xD5 is 
 o-tilde.
 
 So... what's the most elegant way to this (is there one)?  Is there any 
 alternative to just looking up the UTF8 encodings and writing:
 
  put format(This ice cream is \xC2\xA9 Ben and Jerry\xE2\x80\x99s Inc) 
 into kMyConstantString
 
 ?
 
 TIA,
 
 Ben
 
 ___
 use-livecode mailing list
 use-livecode@lists.runrev.com
 Please visit this url to subscribe, unsubscribe and manage your 
 subscription preferences:
 http://lists.runrev.com/mailman/listinfo/use-livecode
 
 ___
 use-livecode mailing list
 use-livecode@lists.runrev.com
 Please visit this url to subscribe, unsubscribe and manage your subscription 
 preferences:
 http://lists.runrev.com/mailman/listinfo/use-livecode
 
 
 ___
 use-livecode mailing list
 use-livecode@lists.runrev.com
 Please visit this url to subscribe, unsubscribe and manage your subscription 
 preferences:
 http://lists.runrev.com/mailman/listinfo/use-livecode

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:

Re: Elegant way to express constant UTF8 string in script?

2014-06-30 Thread Mark Schonewille
Keep in mind that HTML encoded text may not work for some higher-ASCII 
characters. That's exactly the reason why we have Unicode.

--
Kind regards,

Mark Schonewille
Economy-x-Talk
Http://economy-x-talk.com

Share the clipboard of your computer over a local network with Clipboard Link 
http://clipboardlink.economy-x-talk.com


Op 30 jun. 2014 om 19:31 heeft Ben Rubinstein benr...@cogapp.com het volgende 
geschreven:

 On 30/06/2014 16:18, J. Landman Gay wrote:
 This is exactly what I've been dealing with for a week.  You need two steps 
 : first check the platform and if it's Windows then run macToISO on the 
 string.  After that your existing conversion to UTF8 should work.
 
 Aha, good tip, thank you.
 
 On reflection though I think I'm going to adopt a modified version of Peter's 
 suggestion; use HTML entities in the 'constant' string to be unambiguous but 
 readable, passing it through a function called HTMLtoUTF8 so that bit of 
 the script looks clean - and then do a nasty dirty implementation of that 
 function, that just handles the two entities I currently care about and 
 throws an error if invoked on anything else.
 
 I'm all about the elegance, me.
 
 thanks to all who responded,
 
 Ben
 
 ___
 use-livecode mailing list
 use-livecode@lists.runrev.com
 Please visit this url to subscribe, unsubscribe and manage your subscription 
 preferences:
 http://lists.runrev.com/mailman/listinfo/use-livecode

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode