On 2018-06-01 02:14, Tim Selander via use-livecode wrote:
Hi Kee and Alex,

The original documents I'm working with are UTF8, so that's that I've
been using. So converting them to UTF16 is recommended? I'll try that.

Alex, desktop is version 8 something, and the server is the one
installed on the on-rev host; can't remember what the key in $_Server
for than info is, and Googling failed me this time...

You should be fine using 'character' on any unicode text - it uses the Unicode grapheme (specific name of 'character's as human's 'think' of 'character's) breaking rules to find the boundaries.

That being said, I think codepoint (from memory) should also be okay on Japanese text as I don't think the Japanese/Chinese scripts have any multi-codepoint characters - they just use codepoints with value > 65535 for less used ideographs (the 'supplementary plane'). [ Korean script can be encoded with Hangul, which *does* require the use of character as a single Korean Hangul ideograph can be composed of up to three codepoints ].

The fact it is breaking on Japanese text in the way you suggest makes me think you aren't textDecode()'ing your UTF-8 input files:

e.g.
   put textDecode(url ("binfile:<pathtofile>"), "utf-8") into tText

Without decoding as utf-8, the engine will thing your file is 'native' (single-byte encoded), so each byte of the file will be seen as a separate character.

Internally the engine uses either single-byte or double-byte encodings for strings (the latter being UTF-16) - which is not user-visible, you just need to make sure that incoming data is decoded correctly.

Can you share the code you are using to read in the text data and code which is breaking on server?

Warmest Regards,

Mark.

P.S. 'word' in LC is still any sequence of non-space characters separated by spaces, or any sequence of characters delimited by quotes - it takes no account of the script of the text, nor actual word-boundaries. If you want human-style word boundaries then you should use trueWord (which uses the standard Unicode word breaking rules).

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to