Re: Distinguishing between ASCII and UTF8

2010-10-07 Thread Richmond Mathewson

 On 10/7/10 9:39 PM, Jerry J wrote:

On Oct 7, 2010, at 11:05 AM, Lynn Fredricks wrote:


I still have sweaty nightmares about DOS code pages...

I whisper quietly to myself in a corner: "EBCDIC".
--Jerry Jensen




The thing that wakes me in a cold sweat at the Brahma Mahurta
is the FORTRAN "Format".

Richmond
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-07 Thread Jerry J
On Oct 7, 2010, at 11:05 AM, Lynn Fredricks wrote:

> I still have sweaty nightmares about DOS code pages...

I whisper quietly to myself in a corner: "EBCDIC".
--Jerry Jensen


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-07 Thread Mark Schonewille

Hi Bob,

UTF8 is platform independent, ASCII isn't.

--

Economy-x-Talk
Consultancy and Software Engineering
http://economy-x-talk.com
http://www.salery.biz

Get your store on-line within minutes with Salery Web Store software.  
Download at http://www.salery.biz


Op 7-okt-2010, om 18:59 heeft Bob Sneidar het volgende geschreven:

Okay, so that begs the question, if there is no difference between  
UTF8 and ASCII, why make the distinction? I mean, what would be the  
point to converting from ASCII to UTF8 or vis versa if the results  
were always the same?


Just being practical.

Bob


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


RE: Distinguishing between ASCII and UTF8

2010-10-07 Thread Lynn Fredricks
> On 10/7/10 7:59 PM, Bob Sneidar wrote:
> > Okay, so that begs the question, if there is no difference 
> between UTF8 and ASCII, why make the distinction? I mean, 
> what would be the point to converting from ASCII to UTF8 or 
> vis versa if the results were always the same?
> >
> > Just being practical.

UTF8 is (at a minimum) what you want to internationalize your applications.
You can display and manage most of the world's languages with UTF8, though I
am more partial to UTF16 because UTF8 has some limitations when it comes to
searching/sorting with Chinese characters. Today's operating systems pretty
much use UTF16 and may or may not be slapped down to UTF8.

There used to be ASCII and extended ASCII, though I guess they are simply
just ASCII now.

We use UTF16 internally with Valentina, and in cases where the client cannot
handle it, it gets transformed so its useful.

Valentina was chosen years ago by Nikon Corporation for Picture Project, a
piece of software they shipped worldwide with their digital cameras, because
our Unicode support was so good - it made shipping in so many languages easy
for them.

I still have sweaty nightmares about DOS code pages...

Best regards,

Lynn Fredricks
President
Paradigma Software
http://www.paradigmasoft.com

Valentina SQL Server: The Ultra-fast, Royalty Free Database Server 

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-07 Thread Richmond Mathewson

 On 10/7/10 8:02 PM, Bob Sneidar wrote:

I have a saying: You know exactly as much after you say "Maybe..." as you did 
before you said it.


I always wonder about the word 'Maybe' and whether it might be almost 
semantically empty . . .  :)



Bob


On Oct 6, 2010, at 4:55 PM, Richard Gaskin wrote:


Jeff, Dave, Peter:  thank you!

Good stuff - I think I'll be able to distinguish most files using those.

--
Richard Gaskin
Fourth World
LiveCode training and consulting: http://www.fourthworld.com
Webzine for LiveCode developers: http://www.LiveCodeJournal.com
LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-07 Thread Richmond Mathewson

 On 10/7/10 7:59 PM, Bob Sneidar wrote:

Okay, so that begs the question, if there is no difference between UTF8 and 
ASCII, why make the distinction? I mean, what would be the point to converting 
from ASCII to UTF8 or vis versa if the results were always the same?

Just being practical.


Some of us grew up in Britain in the 60s and 70s (Oh, how depressing) 
and remember the feeling of moving from
short trousers to long trousers; as far as I understand ASCII and UTF8 
are somehow the same without the place
being trashed by the . . . . . (whoops, no politics) . . . those of you 
who want to understand my reference should
watch "Carry On At Your Convenience"; a light, easily digestible 
introduction to the politics of the early 70s.



Bob


On Oct 6, 2010, at 1:29 PM, Jeff Massung wrote:


On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
wrote:


I have an app that needs to auto-detect Unicode and plain text, and render
them correctly based on that auto-detection.

I have the UTF16 stuff working, but with UTF8 I have a problem:  there is
no BOM to let me know if it's Unicode, and some plain text files will
occasionally have high-ASCII values in them (like the dagger symbol).

What patterns should I be looking for in the binary data of a file to
distinguish UTF8 from plain text?



Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
is that it's indistinguishable from ASCII (0-127). You may be able to scan
the files, and if they are large enough, try and deduce some thing from them
to know which they are. For example:

On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
text file.

In ASCII there will never be a NULL terminator anywhere (byte 0). There's
likely many 0-byte values in any appreciably large Unicode file. This would
also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
others.

If the number of bytes that have the high bit (0x80) set is extremely low
(<<<  1%) then most likely it's ASCII.

HTH,

Jeff M.
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-07 Thread Bob Sneidar
I have a saying: You know exactly as much after you say "Maybe..." as you did 
before you said it. 

Bob


On Oct 6, 2010, at 4:55 PM, Richard Gaskin wrote:

> Jeff, Dave, Peter:  thank you!
> 
> Good stuff - I think I'll be able to distinguish most files using those.
> 
> --
> Richard Gaskin
> Fourth World
> LiveCode training and consulting: http://www.fourthworld.com
> Webzine for LiveCode developers: http://www.LiveCodeJournal.com
> LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
> ___
> use-revolution mailing list
> use-revolution@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-07 Thread Bob Sneidar
Okay, so that begs the question, if there is no difference between UTF8 and 
ASCII, why make the distinction? I mean, what would be the point to converting 
from ASCII to UTF8 or vis versa if the results were always the same?

Just being practical. 

Bob


On Oct 6, 2010, at 1:29 PM, Jeff Massung wrote:

> On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
> wrote:
> 
>> I have an app that needs to auto-detect Unicode and plain text, and render
>> them correctly based on that auto-detection.
>> 
>> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is
>> no BOM to let me know if it's Unicode, and some plain text files will
>> occasionally have high-ASCII values in them (like the dagger symbol).
>> 
>> What patterns should I be looking for in the binary data of a file to
>> distinguish UTF8 from plain text?
>> 
>> 
> Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
> is that it's indistinguishable from ASCII (0-127). You may be able to scan
> the files, and if they are large enough, try and deduce some thing from them
> to know which they are. For example:
> 
> On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
> text file.
> 
> In ASCII there will never be a NULL terminator anywhere (byte 0). There's
> likely many 0-byte values in any appreciably large Unicode file. This would
> also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
> others.
> 
> If the number of bytes that have the high bit (0x80) set is extremely low
> (<<< 1%) then most likely it's ASCII.
> 
> HTH,
> 
> Jeff M.
> ___
> use-revolution mailing list
> use-revolution@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-06 Thread Richard Gaskin

Jeff, Dave, Peter:  thank you!

Good stuff - I think I'll be able to distinguish most files using those.

--
 Richard Gaskin
 Fourth World
 LiveCode training and consulting: http://www.fourthworld.com
 Webzine for LiveCode developers: http://www.LiveCodeJournal.com
 LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-06 Thread Peter W A Wood
Richard

> I have an app that needs to auto-detect Unicode and plain text, and render 
> them correctly based on that auto-detection.
> 
> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is no 
> BOM to let me know if it's Unicode, and some plain text files will 
> occasionally have high-ASCII values in them (like the dagger symbol).
> 
> What patterns should I be looking for in the binary data of a file to 
> distinguish UTF8 from plain text?

These are the "Rules of Thumb" that I have used to try to determine the 
encoding type of text files. I feel that I achieved more than 90 per cent 
success but that may because most of the files only included true ASCII 
characters (0 -127). The script only tries to distinguish between ASCII, UTF-8, 
MacRoman and Windows 1252 Codepage (the US default for Windows).

Rules of Thumb, applied in the following order:

1. If the string starts with a BOM, the encoding infered by the BOM will be 
returned.

2. If the string contains only characters in the range 0x00 - 0x7F, it is an 
ASCII string.

3. If the string contains more UTF-8 multi-byte characters than it does invalid 
utf-8 characters and invalid multi-byte sequences, it is a UTF-8 string.

4. If the string contains characters in the range 0xA0 - 0xFF but none in the 
range 0x80 - 0x9F, it is an ISO-8859-1 string.

5. If the string contains any of 0x81, 0x8D, 0x8F, 0x90 or 0x9D, it is a 
MacRoman string. .

6. If the string contains carriage returns but no line feeds, it is a MacRoman 
string.

7. It is a Windows 1252 Codepage string.

The approach I take in the script is to count the different types of characters 
in the text and then apply the rules of thumb. The script is written in REBOL 
so will probably not be even be of help as a guide. However, the documentation 
includes a table of the differences between UTF-8, Windows 1252 and MacRoman 
which you may find useful. You can find it at 
http://www.rebol.org/documentation.r?script=str-enc-utils.r

Regards

Peter



___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-06 Thread Dave Cragg
Richard

Below is a function that was translated from a PHP script. It is intended to 
determine whether the passed in string "could be" utf8. I have tested it in a 
limited way  and it seems to work. But maybe someone else can see the flaws.

If it returns false, then it is not UTF8. If it returns true, it fits the 
pattern of utf8, but it could be something else like some random binary.

If it doesn't work, you could perhaps use it to scare children.

function couldBeUtf8 pString
   
   put "(?is)^([\x09\x0A\x0D\x20-\x7E]" into tRE
   put "|[\xC2-\xDF][\x80-\xBF]" after tRE
   put "|\xE0[\xA0-\xBF][\x80-\xBF]" after tRE
   put "|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" after tRE
   put "|\xED[\x80-\x9F][\x80-\xBF]" after tRE
   put "|\xF0[\x90-\xBF][\x80-\xBF]{2}" after tRE
   put "|[\xF1-\xF3][\x80-\xBF]{3}" after tRE 
   put "|\xF4[\x80-\x8F][\x80-\xBF]{2})*$" after tRE
   
   return matchText(pString, tRE)

end couldBeUtf8

Cheers
Dave

On 6 Oct 2010, at 21:23, Richard Gaskin wrote:

> I have an app that needs to auto-detect Unicode and plain text, and render 
> them correctly based on that auto-detection.
> 
> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is no 
> BOM to let me know if it's Unicode, and some plain text files will 
> occasionally have high-ASCII values in them (like the dagger symbol).
> 
> What patterns should I be looking for in the binary data of a file to 
> distinguish UTF8 from plain text?
> 
> --
> Richard Gaskin
> Fourth World
> LiveCode training and consulting: http://www.fourthworld.com
> Webzine for LiveCode developers: http://www.LiveCodeJournal.com
> LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
> ___
> use-revolution mailing list
> use-revolution@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Distinguishing between ASCII and UTF8

2010-10-06 Thread Jeff Massung
On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
wrote:

> I have an app that needs to auto-detect Unicode and plain text, and render
> them correctly based on that auto-detection.
>
> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is
> no BOM to let me know if it's Unicode, and some plain text files will
> occasionally have high-ASCII values in them (like the dagger symbol).
>
> What patterns should I be looking for in the binary data of a file to
> distinguish UTF8 from plain text?
>
>
Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
is that it's indistinguishable from ASCII (0-127). You may be able to scan
the files, and if they are large enough, try and deduce some thing from them
to know which they are. For example:

On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
text file.

In ASCII there will never be a NULL terminator anywhere (byte 0). There's
likely many 0-byte values in any appreciably large Unicode file. This would
also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
others.

If the number of bytes that have the high bit (0x80) set is extremely low
(<<< 1%) then most likely it's ASCII.

HTH,

Jeff M.
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Distinguishing between ASCII and UTF8

2010-10-06 Thread Richard Gaskin
I have an app that needs to auto-detect Unicode and plain text, and 
render them correctly based on that auto-detection.


I have the UTF16 stuff working, but with UTF8 I have a problem:  there 
is no BOM to let me know if it's Unicode, and some plain text files will 
occasionally have high-ASCII values in them (like the dagger symbol).


What patterns should I be looking for in the binary data of a file to 
distinguish UTF8 from plain text?


--
 Richard Gaskin
 Fourth World
 LiveCode training and consulting: http://www.fourthworld.com
 Webzine for LiveCode developers: http://www.LiveCodeJournal.com
 LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution