As we know with LC it is pretty straightforward to deal with internationalised 
text for remote databases and unknown user platforms by conversion to utf-8. 
But I have come across a problem with Linux filenames containing non-ascii 
characters which has me befuddled.

My many-years-old app has until now just required all filenames to be in 
standard 7-bit ascii, so it was way past time I brought it up to date. 

The app talks to a database, media  and web site on a unix (DreamHost) server 
using LC server as intermediary.

I create a file say “Carré.txt” on a Mac - the non-ascii character in that name 
being [e-acute] - I shall use this convention from now on to ensure what is 
displayed here on the forum is understood.

BTW, as far as I can determine that character in the Mac file system is a 
single byte hex [8e], the classic MacRoman encoding, not its utf-8 2-byte 
[C3A9] encoding. So I don’t understand how macOS handles unicode in its 
filesystem, which it certainly does. We are exhorted to textEncode to utf-8 
when exporting anything outside LC but perhaps not filenames??  If I textEncode 
the filename and save with that name I get a new file “Carr[squareroot 
copyright].txt”. I am befuddled already - how does macOS distinguish MacRoman 
encoding from unicode encoding when it displays a file name? - but that is 
another story for another place..

Oh, and another story: it ain't true that all text in LC is utf-16: While it’s 
not possible using LC-API’s to determine exactly what is inside the black-box 
of an LC variable in memory, it is evidently platform dependent —  that 
MacRoman [8e] is reported as being the relevant byte in the LC variable. What 
can be determined is what is on disk when a stack is saved: there text appears 
to be encoded as a mixture of 7-bit ascii when it can be, utf-16 encoding for 
other characters. Not that we as consumers need to know how the magic is 
performed, as long as it works. Back to my story..

So now I want to upload this file to my remote Linux server. I POST a form, 
prepared with libURLMultiPartFormData, to an LC Server script, which is 
supposed to save the received file.

If I attempt to use the original Mac file name, the server responds “Cannot 
open file Carr[e-acute].txt” 
(this is the Result error message from "open file tFileName for binary write”)

If I send textEncode(filename, utf-8) as the file name, the server responds 
“Cannot open file Carr[squareroot][copyright].txt”

If I textEncode at the client end, and then textDecode on the server it 
responds “Cannot open file Carre[E-grave].txt” (Where did THAT come from? Is 
there a bug in textDecode on Linux LCS?  The native encoding on Linux is 
supposed to be ISO-Latin-1, where E-grave is hex [C8], in MacRoman it is [E9], 
no apparent connections between them or the utf-8 bytes.)

And just as a piece of nonsense, if I send the raw un-Encoded Mac file name, 
but then textDecode on the server, the file is happily saved as “Carr.txt”, 
which is correct since [8e] followed by .  is illegal as utf-8, so the 
[e-acute] is just skipped by textDecode.

Could it be that LCserver cannot create files on Linux  with non-ascii names?!? 
 That doesn’t seem believable. I can of course directly create files on the 
server with non-ascii characters such as e-acute.

Either I am missing something, or surely our European users have seen this 
already, so someone should be able to unfuddle me!

Neville Smythe



_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to