Re: Distinguishing between ASCII and UTF8
On 10/7/10 9:39 PM, Jerry J wrote: On Oct 7, 2010, at 11:05 AM, Lynn Fredricks wrote: I still have sweaty nightmares about DOS code pages... I whisper quietly to myself in a corner: "EBCDIC". --Jerry Jensen The thing that wakes me in a cold sweat at the Brahma Mahurta is the FORTRAN "Format". Richmond ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
On Oct 7, 2010, at 11:05 AM, Lynn Fredricks wrote: > I still have sweaty nightmares about DOS code pages... I whisper quietly to myself in a corner: "EBCDIC". --Jerry Jensen ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
Hi Bob, UTF8 is platform independent, ASCII isn't. -- Economy-x-Talk Consultancy and Software Engineering http://economy-x-talk.com http://www.salery.biz Get your store on-line within minutes with Salery Web Store software. Download at http://www.salery.biz Op 7-okt-2010, om 18:59 heeft Bob Sneidar het volgende geschreven: Okay, so that begs the question, if there is no difference between UTF8 and ASCII, why make the distinction? I mean, what would be the point to converting from ASCII to UTF8 or vis versa if the results were always the same? Just being practical. Bob ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
RE: Distinguishing between ASCII and UTF8
> On 10/7/10 7:59 PM, Bob Sneidar wrote: > > Okay, so that begs the question, if there is no difference > between UTF8 and ASCII, why make the distinction? I mean, > what would be the point to converting from ASCII to UTF8 or > vis versa if the results were always the same? > > > > Just being practical. UTF8 is (at a minimum) what you want to internationalize your applications. You can display and manage most of the world's languages with UTF8, though I am more partial to UTF16 because UTF8 has some limitations when it comes to searching/sorting with Chinese characters. Today's operating systems pretty much use UTF16 and may or may not be slapped down to UTF8. There used to be ASCII and extended ASCII, though I guess they are simply just ASCII now. We use UTF16 internally with Valentina, and in cases where the client cannot handle it, it gets transformed so its useful. Valentina was chosen years ago by Nikon Corporation for Picture Project, a piece of software they shipped worldwide with their digital cameras, because our Unicode support was so good - it made shipping in so many languages easy for them. I still have sweaty nightmares about DOS code pages... Best regards, Lynn Fredricks President Paradigma Software http://www.paradigmasoft.com Valentina SQL Server: The Ultra-fast, Royalty Free Database Server ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
On 10/7/10 8:02 PM, Bob Sneidar wrote: I have a saying: You know exactly as much after you say "Maybe..." as you did before you said it. I always wonder about the word 'Maybe' and whether it might be almost semantically empty . . . :) Bob On Oct 6, 2010, at 4:55 PM, Richard Gaskin wrote: Jeff, Dave, Peter: thank you! Good stuff - I think I'll be able to distinguish most files using those. -- Richard Gaskin Fourth World LiveCode training and consulting: http://www.fourthworld.com Webzine for LiveCode developers: http://www.LiveCodeJournal.com LiveCode Journal blog: http://LiveCodejournal.com/blog.irv ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
On 10/7/10 7:59 PM, Bob Sneidar wrote: Okay, so that begs the question, if there is no difference between UTF8 and ASCII, why make the distinction? I mean, what would be the point to converting from ASCII to UTF8 or vis versa if the results were always the same? Just being practical. Some of us grew up in Britain in the 60s and 70s (Oh, how depressing) and remember the feeling of moving from short trousers to long trousers; as far as I understand ASCII and UTF8 are somehow the same without the place being trashed by the . . . . . (whoops, no politics) . . . those of you who want to understand my reference should watch "Carry On At Your Convenience"; a light, easily digestible introduction to the politics of the early 70s. Bob On Oct 6, 2010, at 1:29 PM, Jeff Massung wrote: On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin wrote: I have an app that needs to auto-detect Unicode and plain text, and render them correctly based on that auto-detection. I have the UTF16 stuff working, but with UTF8 I have a problem: there is no BOM to let me know if it's Unicode, and some plain text files will occasionally have high-ASCII values in them (like the dagger symbol). What patterns should I be looking for in the binary data of a file to distinguish UTF8 from plain text? Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8 is that it's indistinguishable from ASCII (0-127). You may be able to scan the files, and if they are large enough, try and deduce some thing from them to know which they are. For example: On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a text file. In ASCII there will never be a NULL terminator anywhere (byte 0). There's likely many 0-byte values in any appreciably large Unicode file. This would also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few others. If the number of bytes that have the high bit (0x80) set is extremely low (<<< 1%) then most likely it's ASCII. HTH, Jeff M. ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
I have a saying: You know exactly as much after you say "Maybe..." as you did before you said it. Bob On Oct 6, 2010, at 4:55 PM, Richard Gaskin wrote: > Jeff, Dave, Peter: thank you! > > Good stuff - I think I'll be able to distinguish most files using those. > > -- > Richard Gaskin > Fourth World > LiveCode training and consulting: http://www.fourthworld.com > Webzine for LiveCode developers: http://www.LiveCodeJournal.com > LiveCode Journal blog: http://LiveCodejournal.com/blog.irv > ___ > use-revolution mailing list > use-revolution@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
Okay, so that begs the question, if there is no difference between UTF8 and ASCII, why make the distinction? I mean, what would be the point to converting from ASCII to UTF8 or vis versa if the results were always the same? Just being practical. Bob On Oct 6, 2010, at 1:29 PM, Jeff Massung wrote: > On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin > wrote: > >> I have an app that needs to auto-detect Unicode and plain text, and render >> them correctly based on that auto-detection. >> >> I have the UTF16 stuff working, but with UTF8 I have a problem: there is >> no BOM to let me know if it's Unicode, and some plain text files will >> occasionally have high-ASCII values in them (like the dagger symbol). >> >> What patterns should I be looking for in the binary data of a file to >> distinguish UTF8 from plain text? >> >> > Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8 > is that it's indistinguishable from ASCII (0-127). You may be able to scan > the files, and if they are large enough, try and deduce some thing from them > to know which they are. For example: > > On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a > text file. > > In ASCII there will never be a NULL terminator anywhere (byte 0). There's > likely many 0-byte values in any appreciably large Unicode file. This would > also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few > others. > > If the number of bytes that have the high bit (0x80) set is extremely low > (<<< 1%) then most likely it's ASCII. > > HTH, > > Jeff M. > ___ > use-revolution mailing list > use-revolution@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
Jeff, Dave, Peter: thank you! Good stuff - I think I'll be able to distinguish most files using those. -- Richard Gaskin Fourth World LiveCode training and consulting: http://www.fourthworld.com Webzine for LiveCode developers: http://www.LiveCodeJournal.com LiveCode Journal blog: http://LiveCodejournal.com/blog.irv ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
Richard > I have an app that needs to auto-detect Unicode and plain text, and render > them correctly based on that auto-detection. > > I have the UTF16 stuff working, but with UTF8 I have a problem: there is no > BOM to let me know if it's Unicode, and some plain text files will > occasionally have high-ASCII values in them (like the dagger symbol). > > What patterns should I be looking for in the binary data of a file to > distinguish UTF8 from plain text? These are the "Rules of Thumb" that I have used to try to determine the encoding type of text files. I feel that I achieved more than 90 per cent success but that may because most of the files only included true ASCII characters (0 -127). The script only tries to distinguish between ASCII, UTF-8, MacRoman and Windows 1252 Codepage (the US default for Windows). Rules of Thumb, applied in the following order: 1. If the string starts with a BOM, the encoding infered by the BOM will be returned. 2. If the string contains only characters in the range 0x00 - 0x7F, it is an ASCII string. 3. If the string contains more UTF-8 multi-byte characters than it does invalid utf-8 characters and invalid multi-byte sequences, it is a UTF-8 string. 4. If the string contains characters in the range 0xA0 - 0xFF but none in the range 0x80 - 0x9F, it is an ISO-8859-1 string. 5. If the string contains any of 0x81, 0x8D, 0x8F, 0x90 or 0x9D, it is a MacRoman string. . 6. If the string contains carriage returns but no line feeds, it is a MacRoman string. 7. It is a Windows 1252 Codepage string. The approach I take in the script is to count the different types of characters in the text and then apply the rules of thumb. The script is written in REBOL so will probably not be even be of help as a guide. However, the documentation includes a table of the differences between UTF-8, Windows 1252 and MacRoman which you may find useful. You can find it at http://www.rebol.org/documentation.r?script=str-enc-utils.r Regards Peter ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
Richard Below is a function that was translated from a PHP script. It is intended to determine whether the passed in string "could be" utf8. I have tested it in a limited way and it seems to work. But maybe someone else can see the flaws. If it returns false, then it is not UTF8. If it returns true, it fits the pattern of utf8, but it could be something else like some random binary. If it doesn't work, you could perhaps use it to scare children. function couldBeUtf8 pString put "(?is)^([\x09\x0A\x0D\x20-\x7E]" into tRE put "|[\xC2-\xDF][\x80-\xBF]" after tRE put "|\xE0[\xA0-\xBF][\x80-\xBF]" after tRE put "|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" after tRE put "|\xED[\x80-\x9F][\x80-\xBF]" after tRE put "|\xF0[\x90-\xBF][\x80-\xBF]{2}" after tRE put "|[\xF1-\xF3][\x80-\xBF]{3}" after tRE put "|\xF4[\x80-\x8F][\x80-\xBF]{2})*$" after tRE return matchText(pString, tRE) end couldBeUtf8 Cheers Dave On 6 Oct 2010, at 21:23, Richard Gaskin wrote: > I have an app that needs to auto-detect Unicode and plain text, and render > them correctly based on that auto-detection. > > I have the UTF16 stuff working, but with UTF8 I have a problem: there is no > BOM to let me know if it's Unicode, and some plain text files will > occasionally have high-ASCII values in them (like the dagger symbol). > > What patterns should I be looking for in the binary data of a file to > distinguish UTF8 from plain text? > > -- > Richard Gaskin > Fourth World > LiveCode training and consulting: http://www.fourthworld.com > Webzine for LiveCode developers: http://www.LiveCodeJournal.com > LiveCode Journal blog: http://LiveCodejournal.com/blog.irv > ___ > use-revolution mailing list > use-revolution@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Distinguishing between ASCII and UTF8
On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin wrote: > I have an app that needs to auto-detect Unicode and plain text, and render > them correctly based on that auto-detection. > > I have the UTF16 stuff working, but with UTF8 I have a problem: there is > no BOM to let me know if it's Unicode, and some plain text files will > occasionally have high-ASCII values in them (like the dagger symbol). > > What patterns should I be looking for in the binary data of a file to > distinguish UTF8 from plain text? > > Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8 is that it's indistinguishable from ASCII (0-127). You may be able to scan the files, and if they are large enough, try and deduce some thing from them to know which they are. For example: On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a text file. In ASCII there will never be a NULL terminator anywhere (byte 0). There's likely many 0-byte values in any appreciably large Unicode file. This would also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few others. If the number of bytes that have the high bit (0x80) set is extremely low (<<< 1%) then most likely it's ASCII. HTH, Jeff M. ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Distinguishing between ASCII and UTF8
I have an app that needs to auto-detect Unicode and plain text, and render them correctly based on that auto-detection. I have the UTF16 stuff working, but with UTF8 I have a problem: there is no BOM to let me know if it's Unicode, and some plain text files will occasionally have high-ASCII values in them (like the dagger symbol). What patterns should I be looking for in the binary data of a file to distinguish UTF8 from plain text? -- Richard Gaskin Fourth World LiveCode training and consulting: http://www.fourthworld.com Webzine for LiveCode developers: http://www.LiveCodeJournal.com LiveCode Journal blog: http://LiveCodejournal.com/blog.irv ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution