Hi Mark,

Here is the script. The files I'm using are bamboobabies.com/getjapanesetext.lc, and the text it is getting is bamboobabies.com/news.txt.

In the script, there are two lines reading the text file that I've taken turns commenting out....

If you can give me any hints, it would be greatly appreciated.

Tim Selander


<?lc put header "Content-Type: text/html; charset=UTF-8" ?>
<!DOCTYPE HTML>
<html>
    <head>
<meta http-equiv="Content-type" content="text/html; charset=UTF8">
        <title>workbench</title>
    </head>
<body>

<?lc
--This line loads readable japanese text, but putting char 500 to 550 breaks beginning and ending kanji
put url "http://bamboobabies.com/news.txt"; into vText

--When this line is used, none of the put text is readable
--put textDecode(url "binfile:bamboobabies.com/news.txt", "utf-8") into vText

put line 1 of vText

put "<BR><BR><BR><BR>"

put char 500 to 550 of vText
 ?>
</body>
</html>




On 2018.06.01 16:17, Mark Waddingham via use-livecode wrote:

You should be fine using 'character' on any unicode text - it
uses the Unicode grapheme (specific name of 'character's as
human's 'think' of 'character's) breaking rules to find the
boundaries.

That being said, I think codepoint (from memory) should also be
okay on Japanese text as I don't think the Japanese/Chinese
scripts have any multi-codepoint characters - they just use
codepoints with value > 65535 for less used ideographs (the
'supplementary plane'). [ Korean script can be encoded with
Hangul, which *does* require the use of character as a single
Korean Hangul ideograph can be composed of up to three codepoints ].

The fact it is breaking on Japanese text in the way you suggest
makes me think you aren't textDecode()'ing your UTF-8 input files:

e.g.
    put textDecode(url ("binfile:<pathtofile>"), "utf-8") into tText

Without decoding as utf-8, the engine will thing your file is
'native' (single-byte encoded), so each byte of the file will be
seen as a separate character.

Internally the engine uses either single-byte or double-byte
encodings for strings (the latter being UTF-16) - which is not
user-visible, you just need to make sure that incoming data is
decoded correctly.

Can you share the code you are using to read in the text data and
code which is breaking on server?

Warmest Regards,

Mark.

P.S. 'word' in LC is still any sequence of non-space characters
separated by spaces, or any sequence of characters delimited by
quotes - it takes no account of the script of the text, nor
actual word-boundaries. If you want human-style word boundaries
then you should use trueWord (which uses the standard Unicode
word breaking rules).


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to