On 11/18/05, Trevor DeVore <[EMAIL PROTECTED]> wrote: > On Nov 17, 2005, at 5:37 PM, Sarah Reichelt wrote: > > If I UniDecode the text, it comes good except for a weird character at > > the start which I can handle, but is there a neat way to detect the > > encoding of text before I start? I suppose I can just look for the > > word "Subject" and if it isn't there, uniDecode and try again, but it > > seems there should be a way to detect the encoding of the text itself. > > > > Does the weird stuff at the start give me any clues? Checking the > > ASCII codes, the text starts with ASCII 254, ASCII 255, space and then > > the first character of my text. Perhaps that's my answer, but will > > they always be 254 & 255 or does that vary with the encoding? > > > > Any ideas? > > Hi Sarah, > > The "weird stuff" at the beginning is the BOM. This tells > applications opening the file what kind of UTF file you are dealing > with. Now, I'm not sure how to decipher each BOM but perhaps Google > will know the answer. >
Thanks Trevor, that told me what to look for and provided the answer. Here is a quote from <http://www.unicode.org/versions/Unicode4.0.0/> book chapter 15. "In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file or stream explicitly signals the byte order. The byte sequence <FE16 FF16> may serve as a signature to identify a file as containing Uni- code text. This sequence is exceedingly rare at the outset of text files using other character encodings, whether single- or multiple-byte, and therefore not likely to be confused with real text data." So I think I can be quite safe if I look for charToNum(254) & charToNum(255) at the start of a file and UniDecode the text if they are found. ATB, Sarah _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
