On 2020-03-20 15:34, Paul Dupuis via use-livecode wrote:
Why did I ask this? Because I am interested in comparing the accuracy
of our current handler to any other that may be available as, users
being users, we recently have a user reveal a bug (mis named variable)
in our current function that meant it was missing certain edge cases (
and this user has hundreds of text files that need this edge case to
be properly recognized as MAcRoman encoding. So that bug has been
fixed, but I am still interested in comparing any other giessEncoding
routines to our current one to see if we can do better that we current
are.

Perhaps:

https://pypi.org/project/chardet/

Sounds like it uses similar statistical (perhaps even an ML) model to detect charsets as Mozilla's 'UCD' (as mentioned by someone else in this thread).

As always, thank for reading and responding Mark. We're actually doing
what you suggest. We had a set of QA test cases (text files in many
different line endings and encodings), some intended to fail (such as
Windows Code Page's we don't support). We're expanding these and doing
a review on macOS and Windows with our app. Ones that fail, that we
think shouldn't fail, we will step through the code to see why they
fail and if our algorithm can be further enhanced. I can's foresee any
algorithm tweaks we can't code ourselves that we'd need LC or USE-LIST
assistance for.

My main reason for asking was to see if it seemed a reasonable assumption (to me, at least) that there would be any algorithm which would be able to determine the char encoding correctly. e.g. MacRoman and Windows-1252, are very very similar, and so telling the difference would come with a reasonably high degree of error.

Back around LiveCode 7, Fraiser said, in response to some
correspondence I had with him, that he would consider creating a
"guessEncoding" to go along with the Unicode Everywhere work and the
new textEncode/textDecode functions. I do understand the reluctance,
as a business, to do so, as inevitably there will be some instances
where it guesses wrong.

I can't recall exactly - but I think Fraser was thinking along the lines of being able to tell the difference between the utf-8, utf-16, utf-32 and native encodings. That can be done with a high-degree of confidence, and indeed is straightforward enough to code in LiveCode Script. (e.g. You can be almost 100% sure something is utf-8 if it roundtrips identically).

As I'm sure you are acutely aware, the difficult problem is telling the difference between very dense shift-sequence encodings (those which don't have some redundancy in their encodings to help with validation), and single-char encodings (e.g. between MacRoman and Latin-1). There is no algorithm for that per-se, just lots of heuristics (based on statistical models) and potential dictionary lookup to help distinguish edge cases. Implementing something such as that is no small endeavour...

I am under the, perhaps false, impression that isoToMac and macToIso
are sort of viewed as functions that may become deprecated and no
longer updated in the future. However, they are still essential for us
until I can textDecode(someData,"MacRoman") on a Windows system and
vice versa.

They've not been deprecated yet so they aren't going anywhere - the internal functions those wrap are actually used to charset-swap strings in pre-v7 binary stackfiles (from v7, strings are serialized as utf-8 in stackfiles).

We probably will deprecate them when we make textDecode/Encode accept more encodings (as suggested in the enhancement request) - but only because the latter is a much neater way to do things... I believe the code you use at the moment gives identical results as textDecode/Encode native support would do doesn't it?

Warmest Regards,

Mark

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to