Re: Guessing the encoding of a test file...

Paul Dupuis via use-livecode Thu, 19 Mar 2020 16:48:24 -0700

Users of our application may use text files any whatever encoding theirlocal system creates them in. We can not tell them to only create suchfiles with a specific encoding. So, we need to detect the encoding ofthe text file the user selects.

As I mentioned, I have an LC script that implements a encoding guessingalgorithm. I am looking for an alternative or better one if someone outthere happened to have created one they might like to share or license.

Any such routine needs to work on macOS and Windows and return the typesused by the LC textDecode function.


I already knew about file on OSX, but I needs a x-platform solution.


On 3/19/2020 6:15 PM, Pi Digital via use-livecode wrote:

On a mac it’s easy. Use
file -I “MyFile.txt”
  as a shell script.

On Windows it’s near impossible without running a whole bunch or arbitrary 
tests that may or may not be correct - certainly not accurate.

What kind of text were you hoping to see? Was you looking for a particular 
encoding? If it is grammatical text there’s are a bunch or runs you can do to 
see what character sets are used but even then it’s only a 
‘probably’/‘possibly’ response.

Sean Cole
Pi Digital

On 19 Mar 2020, at 20:31, Paul Dupuis via use-livecode 
<use-livecode@lists.runrev.com> wrote:

This has come up many times before, but I'll ask once again in case something 
has changed or someone new sees this.


Does anyone have a routine that will take a filespec to a text file and return 
the guessed encoding of the text file?


First, please don't respond with your should know the encoding or the users 
should know the encoding of their files. Not possible in the widely 
uncontrolled real world.

I do already have a routine to guess file encodings. It was written by someone 
else. There are instances where it should work and does not. I fear there may 
be errors in the algorithm and I do not have the original algorithm to check it 
against. Hence, I am looking for an alternative that is either free to use or 
to be licensed for a modest fee.

My current routine attempts to return the encoding as a string that can be 
directly passed to textDecode(binaryData,encoding)

"ASCII"
"UTF-16"
"UTF-16BE"
"UTF-16LE"
"UTF-32"
"UTF-32BE"
"UTF-32LE"
"UTF-8"
"CP1252" *
"MacRoman" *

* for these last 2, if the file is MacRoman on a Windows system, you actually have to 
textDecode(macToISO(data),"CP1252") and if you have CP1252 on the Mac, you need to do 
textDecode(isoToMac(data),"MacRoman"). There is an enhancement request to support 
MacRoman decoding under WIndows and vice versa at 
https://quality.livecode.com/show_bug.cgi?id=22391 if you want to CC yourself to show interest.


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode



_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

Reply via email to