My initial post was concerned with trying to guess whether the file that the user pointed at was likely to be formatted correctly... and I was just looking for plain ASCII. I learned even more than expected!

On Jul 10, 2006, at 11:20 AM, Dar Scott wrote:

On Jul 9, 2006, at 12:59 AM, Scott Morrow wrote:

Does anyone have a method for determining whether a file is plain text that they would be willing to share?

I don't think plain text or not is the right question. How sure do you want to be? This can take a lot of processing.

Do you mean plain text vs binary? Plain text vs RTF? Plain text ASCII vs plain text UTF-8?

For example: I have a function I use that tries to "guess" the Unicode encoding form of a file. My approach is not to ask "is this this format?" but "is this more likely this one than the others under consideration?". (That gets hard under some perverse cases of UTF-16BE vs UTF-16LE. Brag: My Unicode recognizer code beats my Microsoft programs in encoding guessing.) I have a few hard rules to handle the easy cases, but for the most part I build up evidence points and then compare.

Also, I don't look at the whole file (except in some special cases). I look at only the characters near the end and near the front. That puts an upper bound on determination time.


Is the question "Should I dump this into a field or should I convert to hex first?" ?

Dar Scott
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to