On Dec 11, 2006, at 3:09 PM, Chris Sheffield wrote:

Does anyone have a sure fire way to determine if a file is binary or text?

I have need to create an import utility that will import data from a text file (csv, tab-delimited, etc) into a database, but I'd like to check the file before doing anything else just to make sure it is in fact text and not binary.

In general, there is no way.

However, all is not lost.

A text file is a special case of a binary file consisting of a sequence of characters whose representations are binary.

For very short files, it is hard to tell. However, if you have some idea of the pattern you are expecting you can increase your confidence that some file is binary or text.

Many file formats have magic words and header data that indicate the type. These provide a hint and an additional check can provide some confidence. For example, a magic word plus a required element can identify a .png file, that is, check to see whether it starts with this: format("\211PNG\r\n\032\n\000\000\000\015IHDR").

Unicode files often have BOM markers at the start, but they are not required in some cases and the BOM shouldn't be there in others. I have a function I use to differentiate among Unicode files, but that already assumes I know they are unicode and even then it has trouble with some perverse files. (It does get it right more often than Microsoft programs do.) UTF-8 files also have other limitations among the characters, so that can help.

Text files should have certain patterns. For example, if the file is ASCII and is comma-delimited or tab-delimited, there are some indicators. You should see only certain control characters. You should see the expected delimiter. You should see either CR or LF or both. All characters have codes less than 128. You might want to require the same number of delimiters per line.

So, given some specified pattern of what you expect in binary or text, you should be able to differentiate.

However, an alternate approach would be to parse the file and if the file does not pass, then reject it no matter the form of the data.

Dar

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to