Here is the script. The files I'm using are
bamboobabies.com/getjapanesetext.lc, and the text it is getting
In the script, there are two lines reading the text file that
I've taken turns commenting out....
If you can give me any hints, it would be greatly appreciated.
<?lc put header "Content-Type: text/html; charset=UTF-8" ?>
<meta http-equiv="Content-type" content="text/html;
--This line loads readable japanese text, but putting char 500 to
550 breaks beginning and ending kanji
put url "http://bamboobabies.com/news.txt" into vText
--When this line is used, none of the put text is readable
--put textDecode(url "binfile:bamboobabies.com/news.txt",
"utf-8") into vText
put line 1 of vText
put char 500 to 550 of vText
On 2018.06.01 16:17, Mark Waddingham via use-livecode wrote:
You should be fine using 'character' on any unicode text - it
uses the Unicode grapheme (specific name of 'character's as
human's 'think' of 'character's) breaking rules to find the
That being said, I think codepoint (from memory) should also be
okay on Japanese text as I don't think the Japanese/Chinese
scripts have any multi-codepoint characters - they just use
codepoints with value > 65535 for less used ideographs (the
'supplementary plane'). [ Korean script can be encoded with
Hangul, which *does* require the use of character as a single
Korean Hangul ideograph can be composed of up to three codepoints ].
The fact it is breaking on Japanese text in the way you suggest
makes me think you aren't textDecode()'ing your UTF-8 input files:
put textDecode(url ("binfile:<pathtofile>"), "utf-8") into tText
Without decoding as utf-8, the engine will thing your file is
'native' (single-byte encoded), so each byte of the file will be
seen as a separate character.
Internally the engine uses either single-byte or double-byte
encodings for strings (the latter being UTF-16) - which is not
user-visible, you just need to make sure that incoming data is
Can you share the code you are using to read in the text data and
code which is breaking on server?
P.S. 'word' in LC is still any sequence of non-space characters
separated by spaces, or any sequence of characters delimited by
quotes - it takes no account of the script of the text, nor
actual word-boundaries. If you want human-style word boundaries
then you should use trueWord (which uses the standard Unicode
word breaking rules).
use-livecode mailing list
Please visit this url to subscribe, unsubscribe and manage your subscription