subject:"Re\: Guessing the encoding of a test file..."

Re: Guessing the encoding of a test file...

2020-03-25 Thread Ben Rubinstein via use-livecode


On 19/03/2020 20:31, Paul Dupuis via use-livecode wrote:
There is an enhancement request to support MacRoman decoding under WIndows and 
vice versa at https://quality.livecode.com/show_bug.cgi?id=22391 if you want 
to CC yourself to show interest.


See also
https://quality.livecode.com/show_bug.cgi?id=12205

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-22 Thread Paul Dupuis via use-livecode


On 3/22/2020 8:41 AM, Mark Waddingham via use-livecode wrote:

On 2020-03-21 14:09, Paul Dupuis via use-livecode wrote:

So far the only person who has read my post and replied with what I
was looking for was Peter - and although the routine was written in
Rebol rather than LiveCode, he kindly provided a link to information
about it.


It might have got lost in amongst other replies but I did suggest:




Thank Mark.

I apologize. I did miss the reference to Chardet. At one point we looked 
at wrapping a C++ interface to the Mozilla code, but we don't have 
anyone here who has had the time to learn LCB and FFI, I know! We should 
make the time!





It even comes with a command-line script (chardetect) which would 
allow to compare your detector with that one.


However, on further digging it appears that this does not (as it 
stands) detect MacRoman which is obviously a key requirement here.


MacRoman detection is a essential requirement. One of our selling 
points, since many Universities these days have people on mixed 
platforms, is that our tool is nearly identical across macOS and Windows 
to facility researcher collaborations, so we do have people sending 
files created on their Macs to Windows team members and vice versa, so 
we have to detect MacRoman and CP1252 on both platforms.




There is a stale PR for that though 
 so the method used here is 
obviously possible to extend to that.


From what I have read the Python one is a python reimplementation of 
Mozilla's 'Universal Charset Detector' which, from what I have read, 
is/was pretty much state of the art - reading through the chardet docs 
(https://chardet.readthedocs.io/en/latest/how-it-works.html#single-byte-encodings 
is perhaps the most pertinent) it sounds like its single-byte 
detectors use 2-byte sequences to try and distinguish.


There is a special case for Latin-1 (1252) which is needed because 
English text looks the same in a large number of encodings - this 
works by looking for curly quotes and other special symbols by the 
look of it. (The MacRoman addition in the stale PR above, is also a 
Latin-1 like special-case - which makes sense as Latin-1 and MacRoman 
are almost just permutations of each other).


My general feeling is that if you already have a process which works 
to detect the differences between MacRoman and Latin-1, then it is 
likely largely equivalent to any other means which exists (the 
accepted answer here 
 
sounds like it pretty much sums up the situation!) so beyond fixing 
the bug(s) you found recently, you might find that there is nothing 
more you can do.


And we arrive at the same place! Our review of our code, which failed to 
handle a particular MacRoman detection, and comparing to other encoding 
guessing algorithms, turned up a couple issues - all fixable in our 
code, and only one was a encoding guessing issue.


In our guessEncoding routine, there was a misspelled variable that was 
preventing the detection of MacRoman from line ending comparisons from 
working properly. I'm not sure how this got past our QA, but - as you 
know - sometimes things do and it did. With that fixed, we getting 
accurate detection of C1252, MacRoman, ASCII, UTF8, UTF16 BE/LE, and 
UTF32 BE/LE on our suite of about 30 different test files.


We also ran into an edge case of Mac cr (ASCII 13) line ending in UTF8 
or UTF16 file needed an adjustment to convert the line ending to linefeeds.


So at this point our code is detecting the encoding for and reading text 
files into LC with a pretty high rate of accuracy.


For anyone else needed such code, I will try to pulling into a single 
library and somehow make it available. All I will ask is that if anyone 
does us it and improved upon it to share the improvement back.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-22 Thread Mark Waddingham via use-livecode


On 2020-03-21 14:09, Paul Dupuis via use-livecode wrote:

So far the only person who has read my post and replied with what I
was looking for was Peter - and although the routine was written in
Rebol rather than LiveCode, he kindly provided a link to information
about it.


It might have got lost in amongst other replies but I did suggest:



It even comes with a command-line script (chardetect) which would allow 
to compare your detector with that one.


However, on further digging it appears that this does not (as it stands) 
detect MacRoman which is obviously a key requirement here.


There is a stale PR for that though 
 so the method used here is 
obviously possible to extend to that.


From what I have read the Python one is a python reimplementation of 
Mozilla's 'Universal Charset Detector' which, from what I have read, 
is/was pretty much state of the art - reading through the chardet docs 
(https://chardet.readthedocs.io/en/latest/how-it-works.html#single-byte-encodings 
is perhaps the most pertinent) it sounds like its single-byte detectors 
use 2-byte sequences to try and distinguish.


There is a special case for Latin-1 (1252) which is needed because 
English text looks the same in a large number of encodings - this works 
by looking for curly quotes and other special symbols by the look of it. 
(The MacRoman addition in the stale PR above, is also a Latin-1 like 
special-case - which makes sense as Latin-1 and MacRoman are almost just 
permutations of each other).


My general feeling is that if you already have a process which works to 
detect the differences between MacRoman and Latin-1, then it is likely 
largely equivalent to any other means which exists (the accepted answer 
here 
 
sounds like it pretty much sums up the situation!) so beyond fixing the 
bug(s) you found recently, you might find that there is nothing more you 
can do.


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-21 Thread peterwawood via use-livecode

PaulIf it would help, I could make a very crude Rebol command line script to 
read a file, guess the encoding and print the encoding. It would give a crude 
way to compare the results against your current routine.Rebol is easy to 
download and doesn't require installation but is 32-bit only so the macOS 
version  won't run on macOS Catalina.PeterPS Once again sorry for the top 
posting.
 Original message From: Paul Dupuis via use-livecode 
 Date: 21/03/2020  22:11  (GMT+08:00) To: 
use-livecode@lists.runrev.com Cc: Paul Dupuis  Subject: 
Re: Guessing the encoding of a test file... On 3/20/2020 8:49 PM, peterwawood 
via use-livecode wrote:> PaulI wrote a simple function to guess the encoding of 
a file but in Rebol not LiveCode. I'm not sure how it compares with your 
current function in terms of accuracy. It is being used by a company which does 
a lot of text processing. (Though I don't know if that is a good reccomendation 
or not). The method I used is explained in the brief documentation 
-http://www.rebol.org/documentation.r?script=str-enc-utils.r]. The rules could 
be used to create a LiveCode function.PeterPS Sorry for top posting, I'm 
replying from a mobile app.Peter,Thank you. While I would have loved to see 
this as a LC script for comparison to my own routine, this is the sort of 
replies I was looking for. Other routines I could compare mine to for seeking 
possible improvements in my own 
code.___use-livecode mailing 
listuse-livecode@lists.runrev.comPlease visit this url to subscribe, 
unsubscribe and manage your subscription 
preferences:http://lists.runrev.com/mailman/listinfo/use-livecode
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-21 Thread Paul Dupuis via use-livecode


Nope.

The reason I refer to the routine as "guessEncoding" is that I 
absolutely know that it is a "guess" based on the presence of nulls and 
other bytes for UTF files and by statistical sampling for various 
characters for MacRoman vs CP1252. We also offer a optional way for the 
user to pick the encoding IF THEY KNOW IT (or I suppose they can keep 
guessing until they get it right),


I'll say it again, I was looking to see if ANY one else had implemented 
a guessEncoding routine and was willing to share of license for 
comparison to my own in hopes of either concluding mine is the best it 
can be OR learning something someone else is doing that improves it a 
little bit.


So far the only person who has read my post and replied with what I was 
looking for was Peter - and although the routine was written in Rebol 
rather than LiveCode, he kindly provided a link to information about it.


On 3/21/2020 4:20 AM, Quentin Long via use-livecode wrote:

I strongly suspect that the desired goal, to have a nice, robust algorithm 
which automagically identifies the encoding of *ABSOLUTELY ANY* text document 
with zero need for human involvement, simply isn't possible. Because text 
encoding is intrinsically arbitrary—see also: the many variations on extended 
(8-bit) ASCII, the various mutually-incompatible versions of EBCDIC, etc ad 
nauseam.
Seems to me, therefore, that in the general case, human involvement is an 
*unavoidable necessity* in determining which encoding an arbitrary text 
document uses. So the goal of any encoding-ID algorithm should *not* be the 
impossible task of determining that encoding *without* human involvement. 
Rather, the goal should be to *minimize* that human involvement, make that 
human involvement as *simple and painless* as practically feasible. So, here 
goes with some semi-random rambling…  Pretty sure the best, most nearly 
bulletproof way to ID a document's text-encoding involves applying that 
encoding to the bits of the document, and showing the resulting 
character-sequence to a human. If there's more than one possibility for the 
document's encoding, apply all of the possible encodings, and show a human all 
of the resulting character-sequences. I'm thinking that a good way to do this 
might be to put up N different text fields in a window, with all of the text 
fields controlled by one scrollbar, and the human clicks on all of the fields 
whose content looks good to them. Or maybe the human clicks on all the fields 
whose output looks *bad* to them? Whichever way works; as long as there *is* 
some human judgement in there somewhere.
Can we assume that once a particular document's text-encoding has been identified, that 
*all* documents which came from the same source as that document use that particular 
encoding? If so, that might simplify the continuing workflow; tell the software 
"This document came from Source X", and the software then uses whichever 
text-encoding it associates with that source. Even if there's more than one such 
text-encoding in play, that's at least easier to work with than having to sort thru an 
arbitrarily large number of text-encodings.
Is it possible to tell the software "hey, no character in $ThisSetOfChars will ever 
appear in this document"? If so, the software should be able to rule out any 
encoding which ends up putting one of the Forbidden Chars into the decoded 
character-sequence.

Given human error, it may be that the human's input ends up ruling out *any 
possible* text-encoding. Prolly a good idea to use something akin to fuzzy 
logic rather than strict Boolean operations.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-21 Thread Paul Dupuis via use-livecode


On 3/20/2020 8:49 PM, peterwawood via use-livecode wrote:

PaulI wrote a simple function to guess the encoding of a file but in Rebol not 
LiveCode. I'm not sure how it compares with your current function in terms of 
accuracy. It is being used by a company which does a lot of text processing. 
(Though I don't know if that is a good reccomendation or not). The method I 
used is explained in the brief documentation 
-http://www.rebol.org/documentation.r?script=str-enc-utils.r]. The rules could 
be used to create a LiveCode function.PeterPS Sorry for top posting, I'm 
replying from a mobile app.

Peter,

Thank you. While I would have loved to see this as a LC script for 
comparison to my own routine, this is the sort of replies I was looking 
for. Other routines I could compare mine to for seeking possible 
improvements in my own code.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-21 Thread Quentin Long via use-livecode

I strongly suspect that the desired goal, to have a nice, robust algorithm 
which automagically identifies the encoding of *ABSOLUTELY ANY* text document 
with zero need for human involvement, simply isn't possible. Because text 
encoding is intrinsically arbitrary—see also: the many variations on extended 
(8-bit) ASCII, the various mutually-incompatible versions of EBCDIC, etc ad 
nauseam.
Seems to me, therefore, that in the general case, human involvement is an 
*unavoidable necessity* in determining which encoding an arbitrary text 
document uses. So the goal of any encoding-ID algorithm should *not* be the 
impossible task of determining that encoding *without* human involvement. 
Rather, the goal should be to *minimize* that human involvement, make that 
human involvement as *simple and painless* as practically feasible. So, here 
goes with some semi-random rambling…  Pretty sure the best, most nearly 
bulletproof way to ID a document's text-encoding involves applying that 
encoding to the bits of the document, and showing the resulting 
character-sequence to a human. If there's more than one possibility for the 
document's encoding, apply all of the possible encodings, and show a human all 
of the resulting character-sequences. I'm thinking that a good way to do this 
might be to put up N different text fields in a window, with all of the text 
fields controlled by one scrollbar, and the human clicks on all of the fields 
whose content looks good to them. Or maybe the human clicks on all the fields 
whose output looks *bad* to them? Whichever way works; as long as there *is* 
some human judgement in there somewhere.
Can we assume that once a particular document's text-encoding has been 
identified, that *all* documents which came from the same source as that 
document use that particular encoding? If so, that might simplify the 
continuing workflow; tell the software "This document came from Source X", and 
the software then uses whichever text-encoding it associates with that source. 
Even if there's more than one such text-encoding in play, that's at least 
easier to work with than having to sort thru an arbitrarily large number of 
text-encodings.
Is it possible to tell the software "hey, no character in $ThisSetOfChars will 
ever appear in this document"? If so, the software should be able to rule out 
any encoding which ends up putting one of the Forbidden Chars into the decoded 
character-sequence.

Given human error, it may be that the human's input ends up ruling out *any 
possible* text-encoding. Prolly a good idea to use something akin to fuzzy 
logic rather than strict Boolean operations.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread peterwawood via use-livecode

PaulI wrote a simple function to guess the encoding of a file but in Rebol not 
LiveCode. I'm not sure how it compares with your current function in terms of 
accuracy. It is being used by a company which does a lot of text processing. 
(Though I don't know if that is a good reccomendation or not). The method I 
used is explained in the brief documentation - 
http://www.rebol.org/documentation.r?script=str-enc-utils.r]. The rules could 
be used to create a LiveCode function.PeterPS Sorry for top posting, I'm 
replying from a mobile app.
 Original message From: Paul Dupuis via use-livecode 
 Date: 20/03/2020  23:35  (GMT+08:00) To: 
use-livecode@lists.runrev.com Cc: Paul Dupuis  Subject: 
Re: Guessing the encoding of a test file... To Sean and Bob,Thank you for your 
replies. I may not have been clear enough in my original post:We make and sell 
an App for macOS and Windows. It's uses around the world by researchers (not a 
lot of them as it is a niche product) on their computers. The research 
applications allows input of data from text files. The sources of those text 
files are from various source those researcher have. It would negatively impact 
our competitiveness in our market if we forced the users to convert their data 
all to some specific text encoding, so we need to try to "guess" the encoding 
of those text files.There are many published algorithms for doing this and we 
have a past contractor of ours take a "best practice" algorithm and create a 
LCS "guessEncoding function. This replaced a previous guessEncoding function we 
had that from Richard Gaskin, which while quite good, did not cover as many 
test cases and the newer more robust one.My main question to the list was: Has 
anyone out there ALSO written a guessEncoding function they might like to share 
or license?Why did I ask this? Because I am interested in comparing the 
accuracy of our current handler to any other that may be available as, users 
being users, we recently have a user reveal a bug (mis named variable) in our 
current function that meant it was missing certain edge cases ( and this user 
has hundreds of text files that need this edge case to be properly recognized 
as MAcRoman encoding. So that bug has been fixed, but I am still interested in 
comparing any other giessEncoding routines to our current one to see if we can 
do better that we current are.To Mark,As always, thank for reading and 
responding Mark. We're actually doing what you suggest. We had a set of QA test 
cases (text files in many different line endings and encodings), some intended 
to fail (such as Windows Code Page's we don't support). We're expanding these 
and doing a review on macOS and Windows with our app. Ones that fail, that we 
think shouldn't fail, we will step through the code to see why they fail and if 
our algorithm can be further enhanced. I can's foresee any algorithm tweaks we 
can't code ourselves that we'd need LC or USE-LIST assistance for.Back around 
LiveCode 7, Fraiser said, in response to some correspondence I had with him, 
that he would consider creating a "guessEncoding" to go along with the Unicode 
Everywhere work and the new textEncode/textDecode functions. I do understand 
the reluctance, as a business, to do so, as inevitably there will be some 
instances where it guesses wrong. Other than LC adding a guessEncoding function 
using some open source library, I would say the area where LC could be the most 
help would be with this enhancement 
https://quality.livecode.com/show_bug.cgi?id=22391I am under the, perhaps 
false, impression that isoToMac and macToIso are sort of viewed as functions 
that may become deprecated and no longer updated in the future. However, they 
are still essential for us until I can textDecode(someData,"MacRoman") on a 
Windows system and vice 
versa.___use-livecode mailing 
listuse-livecode@lists.runrev.comPlease visit this url to subscribe, 
unsubscribe and manage your subscription 
preferences:http://lists.runrev.com/mailman/listinfo/use-livecode
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file... [OT]

2020-03-20 Thread doc hawk via use-livecode



On Mar 20, 2020, at 4:04 PM, Mark Wieder via use-livecode 
 wrote:
> 
> Even Morse code got a new character recently.

But does livecode support that character?

:)
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file... [OT]

2020-03-20 Thread Mark Wieder via use-livecode


On 3/20/20 1:47 PM, doc hawk via use-livecode wrote:


They created a *new* five bit, shifted code, rather than just using Baudot


Even Morse code got a new character recently.

--
 Mark Wieder
 ahsoftw...@gmail.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: Guessing the encoding of a test file... [OT]

2020-03-20 Thread Ralph DiMola via use-livecode

It was essentially Baudot on the way in(some special char diffs) then
shifted to numeric on the way out. As I remember the shifted numeric was not
any of the existing Baudot variants. Alpha Stock symbols in and numeric
stock quotes out. If I remember correctly the first 2 chars selected a head
on a magnetic drum and then waited for the last chars match on that head.
Every user just had a new card plugged in. Reponses time was not user
dependent. 1 user or 1k users. A query just had to wait for rotational
latency.

Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net

-Original Message-
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf
Of doc hawk via use-livecode
Sent: Friday, March 20, 2020 4:48 PM
To: How to use LiveCode
Cc: doc hawk
Subject: Re: Guessing the encoding of a test file... [OT]

On Mar 20, 2020, at 12:51 PM, Ralph DiMola via use-livecode
 wrote:
> 
> Just for a laugh... one of the more esoteric codings I used in the quasi
modern error (besides EBCDIC) was the 5 bit Quotron stock ticker system in
the mid 90s. It used different codes for requesting/receiving quotes because
2^5 is only 32 possible characters. Alpha in/numeric out.

They created a *new* five bit, shifted code, rather than just using
Baudot

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file... [OT]

2020-03-20 Thread Paul Dupuis via use-livecode


On 3/20/2020 4:47 PM, doc hawk via use-livecode wrote:

On Mar 20, 2020, at 12:51 PM, Ralph DiMola via use-livecode 
 wrote:

Just for a laugh... one of the more esoteric codings I used in the quasi modern 
error (besides EBCDIC) was the 5 bit Quotron stock ticker system in the mid 
90s. It used different codes for requesting/receiving quotes because 2^5 is 
only 32 possible characters. Alpha in/numeric out.

They created a *new* five bit, shifted code, rather than just using Baudot




From a guessEncoding perspective, you just scan the bytes and if all 
byte values are in the 0-31 range, you have a 5-bit code.


Then it get harder to determine whether it is Quotron or Baudot...


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file... [OT]

2020-03-20 Thread doc hawk via use-livecode



On Mar 20, 2020, at 12:51 PM, Ralph DiMola via use-livecode 
 wrote:
> 
> Just for a laugh... one of the more esoteric codings I used in the quasi 
> modern error (besides EBCDIC) was the 5 bit Quotron stock ticker system in 
> the mid 90s. It used different codes for requesting/receiving quotes because 
> 2^5 is only 32 possible characters. Alpha in/numeric out.

They created a *new* five bit, shifted code, rather than just using Baudot


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: Guessing the encoding of a test file... [OT]

2020-03-20 Thread Ralph DiMola via use-livecode

Just for a laugh... one of the more esoteric codings I used in the quasi modern 
error (besides EBCDIC) was the 5 bit Quotron stock ticker system in the mid 
90s. It used different codes for requesting/receiving quotes because 2^5 is 
only 32 possible characters. Alpha in/numeric out.

Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net

-Original Message-
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf Of 
doc hawk via use-livecode
Sent: Friday, March 20, 2020 2:13 PM
To: How to use LiveCode
Cc: doc hawk
Subject: Re: Guessing the encoding of a test file...

On Mar 20, 2020, at 11:09 AM, Paul Dupuis via use-livecode 
 wrote:
> 
> Okay, now you going for the low blow :-)\

What part of “lawyer” wasn’t clear? B b

:_)

> Next, you'll be suggesting I need to check for EBCDIC encodings!

That will be a start, but it’s not done until you include Baudot.

Morse, however, is optional . . .
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread doc hawk via use-livecode

On Mar 20, 2020, at 11:09 AM, Paul Dupuis via use-livecode 
 wrote:
> 
> Okay, now you going for the low blow :-)\

What part of “lawyer” wasn’t clear? B b

:_)

> Next, you'll be suggesting I need to check for EBCDIC encodings!

That will be a start, but it’s not done until you include Baudot.

Morse, however, is optional . . .
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread Paul Dupuis via use-livecode


On 3/20/2020 1:11 PM, doc hawk via use-livecode wrote:

On Mar 19, 2020, at 1:31 PM, Paul Dupuis via use-livecode 
 wrote:

“ASCII"

Wait, you’re not going to distinguish between six and seven bit ASCII?

:_)




Okay, now you going for the low blow :-)

Next, you'll be suggesting I need to check for EBCDIC encodings!

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread Paul Dupuis via use-livecode


On 3/20/2020 1:44 PM, Richard Gaskin via use-livecode wrote:
I would be interested to learn more about the details of the 
subsequent refinements over the decade since, but also the ROI 
proposition for today:


I'll try to remember to share the current code after this current 
review. I'm happy to put it out there for others who may need something. 
It adds a few more statistical samplings for MacRoman vs CP1252/Latin 1 
over your excellent original routine that catches a few more correct 
guesses.


As for the diminishing returns and ROI for today, I am not sure there is 
any sort of general ROI for further enhancing the current routine. It 
does just about every best practice for detection there is (to the best 
of my knowledge). That said, the current case is of a researcher with a 
edge variant who happens to be a long time customer AND has a *LOT* of 
text file that should come up as MacRoman but were not. With one more 
tweak (a tiny bug of a mistypes variable name) they now do detect correctly.


If the customer wasn't a long time customer and someone with lots of 
data with this problem, I probably would not invest this level of effort.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread Richard Gaskin via use-livecode

Paul Dupuis wrote:

> There are many published algorithms for doing this and we have a past
> contractor of ours take a "best practice" algorithm and create a LCS
> "guessEncoding function. This replaced a previous guessEncoding
> function we had that from Richard Gaskin, which while quite good, did
> not cover as many test cases and the newer more robust one.

The algo I wrote for you a decade ago was an amalgam of best efforts 
culled throughout this community at the time. It even included a 
variant, refined in our testing, of statistical analysis of certain 
patterns identified by Peter Haworth for files without explicit declaration.

At the time, running the algo through the test collection of some ~200 
widely varying sample documents, some of which even mixed different 
encodings, we compared our results with those from Apple's TextEdit and 
found that our algo correctly identified encoding at least 15% more 
often than TextEdit.

Once we bested Apple on that by an appreciable margin, all of us on the 
team reviewed the results and determined that we were clearly looking at 
a case of diminishing returns in terms of cost-to-further-refine vs 
actual percentage of documents in use requiring such refinement.

I would be interested to learn more about the details of the subsequent 
refinements over the decade since, but also the ROI proposition for today:

Given that another ten years has passed with modern encoding, and that 
older encodings like CP1252 (premiered in Windows 1.0 and popularized in 
Windows 95) are rarely seen in modern usage (as of March 2020 Wikipedia 
notes only 0.4% of web pages using that encoding), what percentage of 
documents your customers need to work with will benefit from further 
investment in refining that algo?

--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web

 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread Mark Waddingham via use-livecode


On 2020-03-20 15:34, Paul Dupuis via use-livecode wrote:

Why did I ask this? Because I am interested in comparing the accuracy
of our current handler to any other that may be available as, users
being users, we recently have a user reveal a bug (mis named variable)
in our current function that meant it was missing certain edge cases (
and this user has hundreds of text files that need this edge case to
be properly recognized as MAcRoman encoding. So that bug has been
fixed, but I am still interested in comparing any other giessEncoding
routines to our current one to see if we can do better that we current
are.


Perhaps:

https://pypi.org/project/chardet/

Sounds like it uses similar statistical (perhaps even an ML) model to 
detect charsets as Mozilla's 'UCD' (as mentioned by someone else in this 
thread).



As always, thank for reading and responding Mark. We're actually doing
what you suggest. We had a set of QA test cases (text files in many
different line endings and encodings), some intended to fail (such as
Windows Code Page's we don't support). We're expanding these and doing
a review on macOS and Windows with our app. Ones that fail, that we
think shouldn't fail, we will step through the code to see why they
fail and if our algorithm can be further enhanced. I can's foresee any
algorithm tweaks we can't code ourselves that we'd need LC or USE-LIST
assistance for.


My main reason for asking was to see if it seemed a reasonable 
assumption (to me, at least) that there would be any algorithm which 
would be able to determine the char encoding correctly. e.g. MacRoman 
and Windows-1252, are very very similar, and so telling the difference 
would come with a reasonably high degree of error.



Back around LiveCode 7, Fraiser said, in response to some
correspondence I had with him, that he would consider creating a
"guessEncoding" to go along with the Unicode Everywhere work and the
new textEncode/textDecode functions. I do understand the reluctance,
as a business, to do so, as inevitably there will be some instances
where it guesses wrong.


I can't recall exactly - but I think Fraser was thinking along the lines 
of being able to tell the difference between the utf-8, utf-16, utf-32 
and native encodings. That can be done with a high-degree of confidence, 
and indeed is straightforward enough to code in LiveCode Script. (e.g. 
You can be almost 100% sure something is utf-8 if it roundtrips 
identically).


As I'm sure you are acutely aware, the difficult problem is telling the 
difference between very dense shift-sequence encodings (those which 
don't have some redundancy in their encodings to help with validation), 
and single-char encodings (e.g. between MacRoman and Latin-1). There is 
no algorithm for that per-se, just lots of heuristics (based on 
statistical models) and potential dictionary lookup to help distinguish 
edge cases. Implementing something such as that is no small endeavour...



I am under the, perhaps false, impression that isoToMac and macToIso
are sort of viewed as functions that may become deprecated and no
longer updated in the future. However, they are still essential for us
until I can textDecode(someData,"MacRoman") on a Windows system and
vice versa.


They've not been deprecated yet so they aren't going anywhere - the 
internal functions those wrap are actually used to charset-swap strings 
in pre-v7 binary stackfiles (from v7, strings are serialized as utf-8 in 
stackfiles).


We probably will deprecate them when we make textDecode/Encode accept 
more encodings (as suggested in the enhancement request) - but only 
because the latter is a much neater way to do things... I believe the 
code you use at the moment gives identical results as textDecode/Encode 
native support would do doesn't it?


Warmest Regards,

Mark

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread doc hawk via use-livecode


On Mar 19, 2020, at 1:31 PM, Paul Dupuis via use-livecode 
 wrote:
> 
> “ASCII"

Wait, you’re not going to distinguish between six and seven bit ASCII?

:_)

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread Håkan Liljegren via use-livecode

I know that Mozilla had a library for finding text decoding. I don’t think they 
use it anymore though. But I know it was translated into several other 
languages. It was called something like “universal character detection” or 
something equally sexy. Just typing out of my head, so it might be something 
completely different.

Håkan
On 20 Mar 2020, 00:47 +0100, Paul Dupuis via use-livecode 
, wrote:
> Users of our application may use text files any whatever encoding their
> local system creates them in. We can not tell them to only create such
> files with a specific encoding. So, we need to detect the encoding of
> the text file the user selects.
>
> As I mentioned, I have an LC script that implements a encoding guessing
> algorithm. I am looking for an alternative or better one if someone out
> there happened to have created one they might like to share or license.
>
> Any such routine needs to work on macOS and Windows and return the types
> used by the LC textDecode function.
>
> I already knew about file on OSX, but I needs a x-platform solution.
>
>
> On 3/19/2020 6:15 PM, Pi Digital via use-livecode wrote:
> > On a mac it’s easy. Use
> > file -I “MyFile.txt”
> > as a shell script.
> >
> > On Windows it’s near impossible without running a whole bunch or arbitrary 
> > tests that may or may not be correct - certainly not accurate.
> >
> > What kind of text were you hoping to see? Was you looking for a particular 
> > encoding? If it is grammatical text there’s are a bunch or runs you can do 
> > to see what character sets are used but even then it’s only a 
> > ‘probably’/‘possibly’ response.
> >
> > Sean Cole
> > Pi Digital
> >
> >
> > > On 19 Mar 2020, at 20:31, Paul Dupuis via use-livecode 
> > >  wrote:
> > >
> > > This has come up many times before, but I'll ask once again in case 
> > > something has changed or someone new sees this.
> > >
> > >
> > > Does anyone have a routine that will take a filespec to a text file and 
> > > return the guessed encoding of the text file?
> > >
> > >
> > > First, please don't respond with your should know the encoding or the 
> > > users should know the encoding of their files. Not possible in the widely 
> > > uncontrolled real world.
> > >
> > > I do already have a routine to guess file encodings. It was written by 
> > > someone else. There are instances where it should work and does not. I 
> > > fear there may be errors in the algorithm and I do not have the original 
> > > algorithm to check it against. Hence, I am looking for an alternative 
> > > that is either free to use or to be licensed for a modest fee.
> > >
> > > My current routine attempts to return the encoding as a string that can 
> > > be directly passed to textDecode(binaryData,encoding)
> > >
> > > "ASCII"
> > > "UTF-16"
> > > "UTF-16BE"
> > > "UTF-16LE"
> > > "UTF-32"
> > > "UTF-32BE"
> > > "UTF-32LE"
> > > "UTF-8"
> > > "CP1252" *
> > > "MacRoman" *
> > >
> > > * for these last 2, if the file is MacRoman on a Windows system, you 
> > > actually have to textDecode(macToISO(data),"CP1252") and if you have 
> > > CP1252 on the Mac, you need to do textDecode(isoToMac(data),"MacRoman"). 
> > > There is an enhancement request to support MacRoman decoding under 
> > > WIndows and vice versa at 
> > > https://quality.livecode.com/show_bug.cgi?id=22391 if you want to CC 
> > > yourself to show interest.
> > >
> > >
> > > ___
> > > use-livecode mailing list
> > > use-livecode@lists.runrev.com
> > > Please visit this url to subscribe, unsubscribe and manage your 
> > > subscription preferences:
> > > http://lists.runrev.com/mailman/listinfo/use-livecode
> > ___
> > use-livecode mailing list
> > use-livecode@lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your 
> > subscription preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread Paul Dupuis via use-livecode


To Sean and Bob,

Thank you for your replies. I may not have been clear enough in my 
original post:


We make and sell an App for macOS and Windows. It's uses around the 
world by researchers (not a lot of them as it is a niche product) on 
their computers. The research applications allows input of data from 
text files. The sources of those text files are from various source 
those researcher have. It would negatively impact our competitiveness in 
our market if we forced the users to convert their data all to some 
specific text encoding, so we need to try to "guess" the encoding of 
those text files.


There are many published algorithms for doing this and we have a past 
contractor of ours take a "best practice" algorithm and create a LCS 
"guessEncoding function. This replaced a previous guessEncoding function 
we had that from Richard Gaskin, which while quite good, did not cover 
as many test cases and the newer more robust one.


My main question to the list was: Has anyone out there ALSO written a 
guessEncoding function they might like to share or license?


Why did I ask this? Because I am interested in comparing the accuracy of 
our current handler to any other that may be available as, users being 
users, we recently have a user reveal a bug (mis named variable) in our 
current function that meant it was missing certain edge cases ( and this 
user has hundreds of text files that need this edge case to be properly 
recognized as MAcRoman encoding. So that bug has been fixed, but I am 
still interested in comparing any other giessEncoding routines to our 
current one to see if we can do better that we current are.


To Mark,

As always, thank for reading and responding Mark. We're actually doing 
what you suggest. We had a set of QA test cases (text files in many 
different line endings and encodings), some intended to fail (such as 
Windows Code Page's we don't support). We're expanding these and doing a 
review on macOS and Windows with our app. Ones that fail, that we think 
shouldn't fail, we will step through the code to see why they fail and 
if our algorithm can be further enhanced. I can's foresee any algorithm 
tweaks we can't code ourselves that we'd need LC or USE-LIST assistance for.


Back around LiveCode 7, Fraiser said, in response to some correspondence 
I had with him, that he would consider creating a "guessEncoding" to go 
along with the Unicode Everywhere work and the new textEncode/textDecode 
functions. I do understand the reluctance, as a business, to do so, as 
inevitably there will be some instances where it guesses wrong. Other 
than LC adding a guessEncoding function using some open source library, 
I would say the area where LC could be the most help would be with this 
enhancement https://quality.livecode.com/show_bug.cgi?id=22391


I am under the, perhaps false, impression that isoToMac and macToIso are 
sort of viewed as functions that may become deprecated and no longer 
updated in the future. However, they are still essential for us until I 
can textDecode(someData,"MacRoman") on a Windows system and vice versa.




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread Bob Sneidar via use-livecode

If the files submitted to you do not need to retain their original formats for 
your purposes, why not just convert them all to a standard format? it's my 
understanding if you open the file using low level file commands without the 
binfile parameter, LC will convert the data into the local encoding. I might be 
mistaken. 

It would help to have a sample file for testing. 

Bob S


> On Mar 19, 2020, at 16:46 , Paul Dupuis via use-livecode 
>  wrote:
> 
> Users of our application may use text files any whatever encoding their local 
> system creates them in. We can not tell them to only create such files with a 
> specific encoding. So, we need to detect the encoding of the text file the 
> user selects.
> 
> As I mentioned, I have an LC script that implements a encoding guessing 
> algorithm. I am looking for an alternative or better one if someone out there 
> happened to have created one they might like to share or license.
> 
> Any such routine needs to work on macOS and Windows and return the types used 
> by the LC textDecode function.
> 
> I already knew about file on OSX, but I needs a x-platform solution.
> 
> 
> On 3/19/2020 6:15 PM, Pi Digital via use-livecode wrote:
>> On a mac it’s easy. Use
>> file -I “MyFile.txt”
>>  as a shell script.
>> 
>> On Windows it’s near impossible without running a whole bunch or arbitrary 
>> tests that may or may not be correct - certainly not accurate.
>> 
>> What kind of text were you hoping to see? Was you looking for a particular 
>> encoding? If it is grammatical text there’s are a bunch or runs you can do 
>> to see what character sets are used but even then it’s only a 
>> ‘probably’/‘possibly’ response.
>> 
>> Sean Cole
>> Pi Digital
>> 
>> 
>>> On 19 Mar 2020, at 20:31, Paul Dupuis via use-livecode 
>>>  wrote:
>>> 
>>> This has come up many times before, but I'll ask once again in case 
>>> something has changed or someone new sees this.
>>> 
>>> 
>>> Does anyone have a routine that will take a filespec to a text file and 
>>> return the guessed encoding of the text file?
>>> 
>>> 
>>> First, please don't respond with your should know the encoding or the users 
>>> should know the encoding of their files. Not possible in the widely 
>>> uncontrolled real world.
>>> 
>>> I do already have a routine to guess file encodings. It was written by 
>>> someone else. There are instances where it should work and does not. I fear 
>>> there may be errors in the algorithm and I do not have the original 
>>> algorithm to check it against. Hence, I am looking for an alternative that 
>>> is either free to use or to be licensed for a modest fee.
>>> 
>>> My current routine attempts to return the encoding as a string that can be 
>>> directly passed to textDecode(binaryData,encoding)
>>> 
>>> "ASCII"
>>> "UTF-16"
>>> "UTF-16BE"
>>> "UTF-16LE"
>>> "UTF-32"
>>> "UTF-32BE"
>>> "UTF-32LE"
>>> "UTF-8"
>>> "CP1252" *
>>> "MacRoman" *
>>> 
>>> * for these last 2, if the file is MacRoman on a Windows system, you 
>>> actually have to textDecode(macToISO(data),"CP1252") and if you have CP1252 
>>> on the Mac, you need to do textDecode(isoToMac(data),"MacRoman"). There is 
>>> an enhancement request to support MacRoman decoding under WIndows and vice 
>>> versa at https://quality.livecode.com/show_bug.cgi?id=22391 if you want to 
>>> CC yourself to show interest.
>>> 
>>> 
>>> ___
>>> use-livecode mailing list
>>> use-livecode@lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your 
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-20 Thread Mark Waddingham via use-livecode

Rather than throwing ‘the baby out with the bath water’ so to speak... What are 
the precise cases in which the method you have fails? And why do you expect it 
to work in those cases?

Warmest Regards,

Mark

Sent from my iPhone

> On 19 Mar 2020, at 20:32, Paul Dupuis via use-livecode 
>  wrote:
> 
> This has come up many times before, but I'll ask once again in case 
> something has changed or someone new sees this.
> 
> 
> Does anyone have a routine that will take a filespec to a text file and 
> return the guessed encoding of the text file?
> 
> 
> First, please don't respond with your should know the encoding or the users 
> should know the encoding of their files. Not possible in the widely 
> uncontrolled real world.
> 
> I do already have a routine to guess file encodings. It was written by 
> someone else. There are instances where it should work and does not. I fear 
> there may be errors in the algorithm and I do not have the original algorithm 
> to check it against. Hence, I am looking for an alternative that is either 
> free to use or to be licensed for a modest fee.
> 
> My current routine attempts to return the encoding as a string that can be 
> directly passed to textDecode(binaryData,encoding)
> 
> "ASCII"
> "UTF-16"
> "UTF-16BE"
> "UTF-16LE"
> "UTF-32"
> "UTF-32BE"
> "UTF-32LE"
> "UTF-8"
> "CP1252" *
> "MacRoman" *
> 
> * for these last 2, if the file is MacRoman on a Windows system, you actually 
> have to textDecode(macToISO(data),"CP1252") and if you have CP1252 on the 
> Mac, you need to do textDecode(isoToMac(data),"MacRoman"). There is an 
> enhancement request to support MacRoman decoding under WIndows and vice versa 
> at https://quality.livecode.com/show_bug.cgi?id=22391 if you want to CC 
> yourself to show interest.
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-19 Thread Sean Cole (Pi) via use-livecode

You won't want to hear this but unfortunately for Windows you are out of
luck. Text files of themselves do not have the encoding embedded in them in
any form. Once it is written it is stored as a series of one or two byte
characters. If you open it as a binfile or a straight file it appears the
same. It is the lowest common denominator of storage formats. Text encoding
is one of those things that either has to be handled by a human or AI/ML.

All the best

Sean Cole
*Pi Digital *


On Thu, 19 Mar 2020 at 23:46, Paul Dupuis via use-livecode <
use-livecode@lists.runrev.com> wrote:

> Users of our application may use text files any whatever encoding their
> local system creates them in. We can not tell them to only create such
> files with a specific encoding. So, we need to detect the encoding of
> the text file the user selects.
>
> As I mentioned, I have an LC script that implements a encoding guessing
> algorithm. I am looking for an alternative or better one if someone out
> there happened to have created one they might like to share or license.
>
> Any such routine needs to work on macOS and Windows and return the types
> used by the LC textDecode function.
>
> I already knew about file on OSX, but I needs a x-platform solution.
>
>
> On 3/19/2020 6:15 PM, Pi Digital via use-livecode wrote:
> > On a mac it’s easy. Use
> > file -I “MyFile.txt”
> >   as a shell script.
> >
> > On Windows it’s near impossible without running a whole bunch or
> arbitrary tests that may or may not be correct - certainly not accurate.
> >
> > What kind of text were you hoping to see? Was you looking for a
> particular encoding? If it is grammatical text there’s are a bunch or runs
> you can do to see what character sets are used but even then it’s only a
> ‘probably’/‘possibly’ response.
> >
> > Sean Cole
> > Pi Digital
> >
> >
> >> On 19 Mar 2020, at 20:31, Paul Dupuis via use-livecode <
> use-livecode@lists.runrev.com> wrote:
> >>
> >> This has come up many times before, but I'll ask once again in case
> something has changed or someone new sees this.
> >>
> >>
> >> Does anyone have a routine that will take a filespec to a text file and
> return the guessed encoding of the text file?
> >>
> >>
> >> First, please don't respond with your should know the encoding or the
> users should know the encoding of their files. Not possible in the widely
> uncontrolled real world.
> >>
> >> I do already have a routine to guess file encodings. It was written by
> someone else. There are instances where it should work and does not. I fear
> there may be errors in the algorithm and I do not have the original
> algorithm to check it against. Hence, I am looking for an alternative that
> is either free to use or to be licensed for a modest fee.
> >>
> >> My current routine attempts to return the encoding as a string that can
> be directly passed to textDecode(binaryData,encoding)
> >>
> >> "ASCII"
> >> "UTF-16"
> >> "UTF-16BE"
> >> "UTF-16LE"
> >> "UTF-32"
> >> "UTF-32BE"
> >> "UTF-32LE"
> >> "UTF-8"
> >> "CP1252" *
> >> "MacRoman" *
> >>
> >> * for these last 2, if the file is MacRoman on a Windows system, you
> actually have to textDecode(macToISO(data),"CP1252") and if you have CP1252
> on the Mac, you need to do textDecode(isoToMac(data),"MacRoman"). There is
> an enhancement request to support MacRoman decoding under WIndows and vice
> versa at https://quality.livecode.com/show_bug.cgi?id=22391 if you want
> to CC yourself to show interest.
> >>
> >>
> >> ___
> >> use-livecode mailing list
> >> use-livecode@lists.runrev.com
> >> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> >> http://lists.runrev.com/mailman/listinfo/use-livecode
> > ___
> > use-livecode mailing list
> > use-livecode@lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-19 Thread Paul Dupuis via use-livecode

Users of our application may use text files any whatever encoding their
local system creates them in. We can not tell them to only create such
files with a specific encoding. So, we need to detect the encoding of
the text file the user selects.

As I mentioned, I have an LC script that implements a encoding guessing
algorithm. I am looking for an alternative or better one if someone out
there happened to have created one they might like to share or license.

Any such routine needs to work on macOS and Windows and return the types
used by the LC textDecode function.

I already knew about file on OSX, but I needs a x-platform solution.

On 3/19/2020 6:15 PM, Pi Digital via use-livecode wrote:

On a mac it’s easy. Use
file -I “MyFile.txt”
as a shell script.

On Windows it’s near impossible without running a whole bunch or arbitrary
tests that may or may not be correct - certainly not accurate.

What kind of text were you hoping to see? Was you looking for a particular
encoding? If it is grammatical text there’s are a bunch or runs you can do to
see what character sets are used but even then it’s only a
‘probably’/‘possibly’ response.

Sean Cole
Pi Digital

On 19 Mar 2020, at 20:31, Paul Dupuis via use-livecode
wrote:

This has come up many times before, but I'll ask once again in case something
has changed or someone new sees this.

Does anyone have a routine that will take a filespec to a text file and return
the guessed encoding of the text file?

First, please don't respond with your should know the encoding or the users
should know the encoding of their files. Not possible in the widely
uncontrolled real world.

I do already have a routine to guess file encodings. It was written by someone
else. There are instances where it should work and does not. I fear there may
be errors in the algorithm and I do not have the original algorithm to check it
against. Hence, I am looking for an alternative that is either free to use or
to be licensed for a modest fee.

My current routine attempts to return the encoding as a string that can be
directly passed to textDecode(binaryData,encoding)

"ASCII"
"UTF-16"
"UTF-16BE"
"UTF-16LE"
"UTF-32"
"UTF-32BE"
"UTF-32LE"
"UTF-8"
"CP1252" *
"MacRoman" *

* for these last 2, if the file is MacRoman on a Windows system, you actually have to
textDecode(macToISO(data),"CP1252") and if you have CP1252 on the Mac, you need to do
textDecode(isoToMac(data),"MacRoman"). There is an enhancement request to support
MacRoman decoding under WIndows and vice versa at
https://quality.livecode.com/show_bug.cgi?id=22391 if you want to CC yourself to show interest.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

2020-03-19 Thread Pi Digital via use-livecode

On a mac it’s easy. Use 
file -I “MyFile.txt”
 as a shell script. 

On Windows it’s near impossible without running a whole bunch or arbitrary 
tests that may or may not be correct - certainly not accurate. 

What kind of text were you hoping to see? Was you looking for a particular 
encoding? If it is grammatical text there’s are a bunch or runs you can do to 
see what character sets are used but even then it’s only a 
‘probably’/‘possibly’ response. 

Sean Cole
Pi Digital 


> On 19 Mar 2020, at 20:31, Paul Dupuis via use-livecode 
>  wrote:
> 
> This has come up many times before, but I'll ask once again in case 
> something has changed or someone new sees this.
> 
> 
> Does anyone have a routine that will take a filespec to a text file and 
> return the guessed encoding of the text file?
> 
> 
> First, please don't respond with your should know the encoding or the users 
> should know the encoding of their files. Not possible in the widely 
> uncontrolled real world.
> 
> I do already have a routine to guess file encodings. It was written by 
> someone else. There are instances where it should work and does not. I fear 
> there may be errors in the algorithm and I do not have the original algorithm 
> to check it against. Hence, I am looking for an alternative that is either 
> free to use or to be licensed for a modest fee.
> 
> My current routine attempts to return the encoding as a string that can be 
> directly passed to textDecode(binaryData,encoding)
> 
> "ASCII"
> "UTF-16"
> "UTF-16BE"
> "UTF-16LE"
> "UTF-32"
> "UTF-32BE"
> "UTF-32LE"
> "UTF-8"
> "CP1252" *
> "MacRoman" *
> 
> * for these last 2, if the file is MacRoman on a Windows system, you actually 
> have to textDecode(macToISO(data),"CP1252") and if you have CP1252 on the 
> Mac, you need to do textDecode(isoToMac(data),"MacRoman"). There is an 
> enhancement request to support MacRoman decoding under WIndows and vice versa 
> at https://quality.livecode.com/show_bug.cgi?id=22391 if you want to CC 
> yourself to show interest.
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file... [OT]

Re: Guessing the encoding of a test file... [OT]

RE: Guessing the encoding of a test file... [OT]

Re: Guessing the encoding of a test file... [OT]

Re: Guessing the encoding of a test file... [OT]

RE: Guessing the encoding of a test file... [OT]

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

Re: Guessing the encoding of a test file...

27 matches

Site Navigation

Mail list logo

Footer information