Richard
Below is a function that was translated from a PHP script. It is intended to
determine whether the passed in string "could be" utf8. I have tested it in a
limited way and it seems to work. But maybe someone else can see the flaws.
If it returns false, then it is not UTF8. If it returns true, it fits the
pattern of utf8, but it could be something else like some random binary.
If it doesn't work, you could perhaps use it to scare children.
function couldBeUtf8 pString
put "(?is)^([\x09\x0A\x0D\x20-\x7E]" into tRE
put "|[\xC2-\xDF][\x80-\xBF]" after tRE
put "|\xE0[\xA0-\xBF][\x80-\xBF]" after tRE
put "|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" after tRE
put "|\xED[\x80-\x9F][\x80-\xBF]" after tRE
put "|\xF0[\x90-\xBF][\x80-\xBF]{2}" after tRE
put "|[\xF1-\xF3][\x80-\xBF]{3}" after tRE
put "|\xF4[\x80-\x8F][\x80-\xBF]{2})*$" after tRE
return matchText(pString, tRE)
end couldBeUtf8
Cheers
Dave
On 6 Oct 2010, at 21:23, Richard Gaskin wrote:
> I have an app that needs to auto-detect Unicode and plain text, and render
> them correctly based on that auto-detection.
>
> I have the UTF16 stuff working, but with UTF8 I have a problem: there is no
> BOM to let me know if it's Unicode, and some plain text files will
> occasionally have high-ASCII values in them (like the dagger symbol).
>
> What patterns should I be looking for in the binary data of a file to
> distinguish UTF8 from plain text?
>
> --
> Richard Gaskin
> Fourth World
> LiveCode training and consulting: http://www.fourthworld.com
> Webzine for LiveCode developers: http://www.LiveCodeJournal.com
> LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
> _______________________________________________
> use-revolution mailing list
> [email protected]
> Please visit this url to subscribe, unsubscribe and manage your subscription
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution