Re: Guess encoding for text file...

2019-09-19 Thread JJS via use-livecode
ah that does remind me of the Amiga if copying floppys did not go well, 
we used nibble-copy.



btw your surname differs 2 bytes from mine using ascii, with the same 
first name (off topic this is) :)


Op 19-9-2019 om 21:23 schreef Jerry Jensen via use-livecode:

On Sep 19, 2019, at 11:53 AM, Dar Scott Consulting via use-livecode 
 wrote:

Yeah. I love the smell of burning bytes.

4 bits is called a nybble, and
2 bits is called a snyf.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Guess encoding for text file...

2019-09-19 Thread Dar Scott Consulting via use-livecode
And I thought 2 bits was a quarter. 

> On Sep 19, 2019, at 1:23 PM, Jerry Jensen via use-livecode 
>  wrote:
> 
> On Sep 19, 2019, at 11:53 AM, Dar Scott Consulting via use-livecode 
>  wrote:
>> 
>> Yeah. I love the smell of burning bytes.
> 
> 4 bits is called a nybble, and
> 2 bits is called a snyf.
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Guess encoding for text file...

2019-09-19 Thread Jerry Jensen via use-livecode
On Sep 19, 2019, at 11:53 AM, Dar Scott Consulting via use-livecode 
 wrote:
> 
> Yeah. I love the smell of burning bytes.

4 bits is called a nybble, and
2 bits is called a snyf.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Guess encoding for text file...

2019-09-19 Thread Klaus major-k via use-livecode



> Am 19.09.2019 um 20:53 schrieb Dar Scott Consulting via use-livecode 
> :
> 
> Yeah. I love the smell of burning bytes...

... in the morning. :-)

> ...
>> 'Cuz I don't even plan to use a loop if it ain't strictly called for
>> What's that smell? Oh yeah, burning bytes. :)

--
Klaus Major
https://www.major-k.de
kl...@major-k.de


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Guess encoding for text file...

2019-09-19 Thread Dar Scott Consulting via use-livecode
Yeah. I love the smell of burning bytes.

> On Sep 19, 2019, at 12:19 PM, Curry Kenworthy via use-livecode 
>  wrote:
> 
> 
> 'Cuz I don't even plan to use a loop if it ain't strictly called for
> 
> What's that smell? Oh yeah, burning bytes. :)
> 
> Best wishes,
> 
> Curry Kenworthy
> 
> Custom Software Development
> "Better Methods, Better Results"
> LiveCode Training and Consulting
> http://livecodeconsulting.com/
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Guess encoding for text file...

2019-09-19 Thread Curry Kenworthy via use-livecode



'Cuz I don't even plan to use a loop if it ain't strictly called for

What's that smell? Oh yeah, burning bytes. :)

Best wishes,

Curry Kenworthy

Custom Software Development
"Better Methods, Better Results"
LiveCode Training and Consulting
http://livecodeconsulting.com/

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Guess encoding for text file...

2019-09-19 Thread dsc--- via use-livecode
I thought of a quick way to do a first pass and it can almost fit in the margin.

> On Sep 19, 2019, at 10:25 AM, Dar Scott Consulting via use-livecode 
>  wrote:
> 
> UTF-16 and UTF-32 are not needed in your list. Those are BE unless indicated 
> otherwise by a leading BOM. That is, the BE and LE versions are sufficient. 
> 
> ASCII encoding is a subset of CP1252, MacRoman and UTF-8, so that can be 
> classified as UTF-8 if there is no advantage to knowing that it is ASCII. 
> (Printable ASCII is a subset of ISO-8859-1). 
> 
> A couple thoughts in creating a custom function. Your special codes in ASCII 
> files of 1, 2, 3 and 4 can be considered in a custom function. You might have 
> a good idea in just 128 bytes or maybe a few iterations of 32 bytes. You can 
> consider an a priori ordering of likelihood, related to the question of which 
> tests provide the most information in the least time. And if you can't tell 
> the difference, then maybe it doesn't matter. 
> 
> I considered some methods of adjusting probabilities but the overhead means 
> the test chunks should not be trivial. Also, the probability might be 
> simplified to "maybe" and "nope". (However, if there might be errors in the 
> text or discernment needs to rely on text probabilities, the numbers might be 
> best.)  Tests move probabilities from maybe to nope.
> 
> One method might do a batch of unsigned 32-bit int decodes and do logic 
> operations on each of those. That can only do partial elimination tests on 
> UTF-8, but detailed tests can be done afterward. I am not sure about 
> performance, it might be that byteToNum() would be much faster.
> 
> I'm guessing that one can get some good probabilities from the first four 
> bytes.
> 
> So, I agree with Curry. He might not use anything I mentioned, but he can 
> optimize your code for longer files, if you need full checking.
> 
>> On Sep 17, 2019, at 2:05 PM, Paul Dupuis via use-livecode 
>>  wrote:
>> 
>> I started this post of the DEV-LIST. Mark Waddingham kindly responded and 
>> smartly suggested I should move it to the USE-LIST, so that is what I am 
>> doing. I have also pasted Lark's reply below my original post.
>> 
>> -- ORIGINAL POST 
>> 
>> I have a LiveCode Script (LCS) routine that attempts to follow industry 
>> common algorithms for guessing the encoding of a text file.
>> 
>> It's performance can be slower than I would like.
>> 
>> This has led me to wonder in a LiveCode Builder (LCB) library may be the 
>> route to go. Does anyone know the OSX and/or Windows APIs for guessing a 
>> text file's encoding?
>> 
>> I have done a number of google searches, but I am not a C programmer (not in 
>> many decades) and wading through the huge doc sets at MSDN or Apple is 
>> daunting.
>> 
>> I found reference to a windows API:
>> 
>> BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult );
>> 
>> Which suggests to me that such APIs may exists. Does anyone who is better at 
>> finding OS APIs know where to find such APIs? Can you point me to the right 
>> online documentation?
>> 
>> I also found this: 
>> https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
>> 
>> Of course, it would be wonderful if the mothership delivered this. At one 
>> point Frasier said he would back around LC7 something.
>> 
>> https://quality.livecode.com/show_bug.cgi?id=14474
>> 
>> It seems an LCB library that uses OS APIs to return best guess for file 
>> encoding that match up with the textEncode/Decode functions would be a great 
>> addition to LC
>> 
>> * "ASCII"
>> * "UTF-16"
>> * "UTF-16BE"
>> * "UTF-16LE"
>> * "UTF-32"
>> * "UTF-32BE"
>> * "UTF-32LE"
>> * "UTF-8"
>> * "CP1252"
>> * "ISO-8859-1"
>> * "MacRoman"
>> 
>> and I suppose "Binary" as the default if none of the above can be detected
>> 
>> - MARK'S REPLY 
>> On 2019-09-13 16:44, Paul Dupuis wrote:
>>> I have a LiveCode Script (LCS) routine that attempts to follow
>>> industry common algorithms for guessing the encoding of a text file.
>>> 
>>> It's performance can be slower than I would like.
>> 
>> If you share your code perhaps we can help speed it up...
>> 
>>> This has led me to wonder in a LiveCode Builder (LCB) library may be
>>> the route to go. Does anyone know the OSX and/or Windows APIs for
>>> guessing a text file's encoding?
>>> 
>>> I have done a number of google searches, but I am not a C programmer
>>> (not in many decades) and wading through the huge doc sets at MSDN or
>>> Apple is daunting.
>>> 
>>> I found reference to a windows API:
>>> 
>>> BOOL IsTextUnicode(
>>>  const VOID *lpv,
>>>  intiSize,
>>>  LPINT  lpiResult
>>> );
>>> 
>>> Which suggests to me that such APIs may exists. Does anyone who is
>>> better at finding OS APIs know where to find such APIs? Can you point
>>> me to the right online documentation?
>> 
>> Libraries certainly exist: Mo

Re: Guess encoding for text file...

2019-09-19 Thread Dar Scott Consulting via use-livecode
UTF-16 and UTF-32 are not needed in your list. Those are BE unless indicated 
otherwise by a leading BOM. That is, the BE and LE versions are sufficient. 

ASCII encoding is a subset of CP1252, MacRoman and UTF-8, so that can be 
classified as UTF-8 if there is no advantage to knowing that it is ASCII. 
(Printable ASCII is a subset of ISO-8859-1). 

A couple thoughts in creating a custom function. Your special codes in ASCII 
files of 1, 2, 3 and 4 can be considered in a custom function. You might have a 
good idea in just 128 bytes or maybe a few iterations of 32 bytes. You can 
consider an a priori ordering of likelihood, related to the question of which 
tests provide the most information in the least time. And if you can't tell the 
difference, then maybe it doesn't matter. 

I considered some methods of adjusting probabilities but the overhead means the 
test chunks should not be trivial. Also, the probability might be simplified to 
"maybe" and "nope". (However, if there might be errors in the text or 
discernment needs to rely on text probabilities, the numbers might be best.)  
Tests move probabilities from maybe to nope.

One method might do a batch of unsigned 32-bit int decodes and do logic 
operations on each of those. That can only do partial elimination tests on 
UTF-8, but detailed tests can be done afterward. I am not sure about 
performance, it might be that byteToNum() would be much faster.

I'm guessing that one can get some good probabilities from the first four bytes.

So, I agree with Curry. He might not use anything I mentioned, but he can 
optimize your code for longer files, if you need full checking.

> On Sep 17, 2019, at 2:05 PM, Paul Dupuis via use-livecode 
>  wrote:
> 
> I started this post of the DEV-LIST. Mark Waddingham kindly responded and 
> smartly suggested I should move it to the USE-LIST, so that is what I am 
> doing. I have also pasted Lark's reply below my original post.
> 
> -- ORIGINAL POST 
> 
> I have a LiveCode Script (LCS) routine that attempts to follow industry 
> common algorithms for guessing the encoding of a text file.
> 
> It's performance can be slower than I would like.
> 
> This has led me to wonder in a LiveCode Builder (LCB) library may be the 
> route to go. Does anyone know the OSX and/or Windows APIs for guessing a text 
> file's encoding?
> 
> I have done a number of google searches, but I am not a C programmer (not in 
> many decades) and wading through the huge doc sets at MSDN or Apple is 
> daunting.
> 
> I found reference to a windows API:
> 
> BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult );
> 
> Which suggests to me that such APIs may exists. Does anyone who is better at 
> finding OS APIs know where to find such APIs? Can you point me to the right 
> online documentation?
> 
> I also found this: 
> https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
> 
> Of course, it would be wonderful if the mothership delivered this. At one 
> point Frasier said he would back around LC7 something.
> 
> https://quality.livecode.com/show_bug.cgi?id=14474
> 
> It seems an LCB library that uses OS APIs to return best guess for file 
> encoding that match up with the textEncode/Decode functions would be a great 
> addition to LC
> 
>  * "ASCII"
>  * "UTF-16"
>  * "UTF-16BE"
>  * "UTF-16LE"
>  * "UTF-32"
>  * "UTF-32BE"
>  * "UTF-32LE"
>  * "UTF-8"
>  * "CP1252"
>  * "ISO-8859-1"
>  * "MacRoman"
> 
> and I suppose "Binary" as the default if none of the above can be detected
> 
> - MARK'S REPLY 
> On 2019-09-13 16:44, Paul Dupuis wrote:
> > I have a LiveCode Script (LCS) routine that attempts to follow
> > industry common algorithms for guessing the encoding of a text file.
> >
> > It's performance can be slower than I would like.
> 
> If you share your code perhaps we can help speed it up...
> 
> > This has led me to wonder in a LiveCode Builder (LCB) library may be
> > the route to go. Does anyone know the OSX and/or Windows APIs for
> > guessing a text file's encoding?
> >
> > I have done a number of google searches, but I am not a C programmer
> > (not in many decades) and wading through the huge doc sets at MSDN or
> > Apple is daunting.
> >
> > I found reference to a windows API:
> >
> > BOOL IsTextUnicode(
> >   const VOID *lpv,
> >   intiSize,
> >   LPINT  lpiResult
> > );
> >
> >  Which suggests to me that such APIs may exists. Does anyone who is
> > better at finding OS APIs know where to find such APIs? Can you point
> > me to the right online documentation?
> 
> Libraries certainly exist: Mozilla has a 'universal charset detector library' 
> for example, which appears to use various statistical heuristics to tell 
> between all kinds of encodings.
> 
> The 'IsTextUnicode' API seems to just tell you whether a sequence of bytes is 
> likely to be UTF-16 or not UTF-16; s

Re: Guess encoding for text file...

2019-09-18 Thread Paul Dupuis via use-livecode
I am sure my routine could be optimized some for performance. My 
consideration of doing this via OSX/Windows API using LCB FFI is only 
partially about performance. I think an advantage is that the OS vendors 
(with way more resources than me) are more likely to keep the algorithms 
reflecting up to date best practices (I know some of you will laugh at 
that).  Additionally, I might hope that, at some point, LiveCode 
corporate might consider taking it over if it was an collaborative open 
source effort LCB library than a LCS script.



On 9/17/2019 6:45 PM, Curry Kenworthy via use-livecode wrote:


Paul:

> I have a LiveCode Script (LCS) routine that attempts to
> follow industry common algorithms for guessing the encoding
> of a text file.
> It's performance can be slower than I would like.

Howdy,

Even though LC 9 is exceedingly slow on some operations -(cough, 
cough, ahem, that's a topic in itself)- HOWEVER, I believe it's quite 
capable of satisfactory performance in this particular area without 
needing LCB, much less C.


Pretty sure I can optimize your legacy routine to work just fine. It 
simply wasn't designed for huge files, nor for LC 9's slow-mo loop 
speed, but that could be easily remedied with a few tweaks!


Best wishes,

Curry Kenworthy

Custom Software Development
"Better Methods, Better Results"
LiveCode Training and Consulting
http://livecodeconsulting.com/

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Guess encoding for text file...

2019-09-17 Thread Curry Kenworthy via use-livecode



Paul:

> I have a LiveCode Script (LCS) routine that attempts to
> follow industry common algorithms for guessing the encoding
> of a text file.
> It's performance can be slower than I would like.

Howdy,

Even though LC 9 is exceedingly slow on some operations -(cough, cough, 
ahem, that's a topic in itself)- HOWEVER, I believe it's quite capable 
of satisfactory performance in this particular area without needing LCB, 
much less C.


Pretty sure I can optimize your legacy routine to work just fine. It 
simply wasn't designed for huge files, nor for LC 9's slow-mo loop 
speed, but that could be easily remedied with a few tweaks!


Best wishes,

Curry Kenworthy

Custom Software Development
"Better Methods, Better Results"
LiveCode Training and Consulting
http://livecodeconsulting.com/

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Guess encoding for text file...

2019-09-17 Thread Paul Dupuis via use-livecode
I started this post of the DEV-LIST. Mark Waddingham kindly responded 
and smartly suggested I should move it to the USE-LIST, so that is what 
I am doing. I have also pasted Lark's reply below my original post.


-- ORIGINAL POST 



I have a LiveCode Script (LCS) routine that attempts to follow industry 
common algorithms for guessing the encoding of a text file.


It's performance can be slower than I would like.

This has led me to wonder in a LiveCode Builder (LCB) library may be the 
route to go. Does anyone know the OSX and/or Windows APIs for guessing a 
text file's encoding?


I have done a number of google searches, but I am not a C programmer 
(not in many decades) and wading through the huge doc sets at MSDN or 
Apple is daunting.


I found reference to a windows API:

BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult );

Which suggests to me that such APIs may exists. Does anyone who is 
better at finding OS APIs know where to find such APIs? Can you point me 
to the right online documentation?


I also found this: 
https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding


Of course, it would be wonderful if the mothership delivered this. At 
one point Frasier said he would back around LC7 something.


https://quality.livecode.com/show_bug.cgi?id=14474

It seems an LCB library that uses OS APIs to return best guess for file 
encoding that match up with the textEncode/Decode functions would be a 
great addition to LC


 * "ASCII"
 * "UTF-16"
 * "UTF-16BE"
 * "UTF-16LE"
 * "UTF-32"
 * "UTF-32BE"
 * "UTF-32LE"
 * "UTF-8"
 * "CP1252"
 * "ISO-8859-1"
 * "MacRoman"

and I suppose "Binary" as the default if none of the above can be detected

- MARK'S REPLY 
On 2019-09-13 16:44, Paul Dupuis wrote:
> I have a LiveCode Script (LCS) routine that attempts to follow
> industry common algorithms for guessing the encoding of a text file.
>
> It's performance can be slower than I would like.

If you share your code perhaps we can help speed it up...

> This has led me to wonder in a LiveCode Builder (LCB) library may be
> the route to go. Does anyone know the OSX and/or Windows APIs for
> guessing a text file's encoding?
>
> I have done a number of google searches, but I am not a C programmer
> (not in many decades) and wading through the huge doc sets at MSDN or
> Apple is daunting.
>
> I found reference to a windows API:
>
> BOOL IsTextUnicode(
>   const VOID *lpv,
>   int    iSize,
>   LPINT  lpiResult
> );
>
>  Which suggests to me that such APIs may exists. Does anyone who is
> better at finding OS APIs know where to find such APIs? Can you point
> me to the right online documentation?

Libraries certainly exist: Mozilla has a 'universal charset detector 
library' for example, which appears to use various statistical 
heuristics to tell between all kinds of encodings.


The 'IsTextUnicode' API seems to just tell you whether a sequence of 
bytes is likely to be UTF-16 or not UTF-16; so probably won't be all 
that helpful if that isn't all you are wanting to distinguish between.


Do you have a list of encodings you are needing to guess between? That 
will generally influence how fast (and accurate) you can make such a 
function (its almost trivial to detect UTF-8 with a high degree of 
confidence, UTF-32 I think as well, UTF-16 is somewhat harder, and 
distinguishing between single-byte and legacy multi-byte charsets is, 
relatively speaking, very hard).


Warmest Regards,

Mark.

P.S. This might be a better discussion to have on the use-list unless 
there is a reason not to, it might be of interest to others in that 
wider group.


--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
livecode-dev mailing list
livecode-...@lists.runrev.com
http://lists.runrev.com/mailman/listinfo/livecode-dev

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode