On Dec 11, 2006, at 3:09 PM, Chris Sheffield wrote:
Does anyone have a sure fire way to determine if a file is binary
or text?
I have need to create an import utility that will import data from
a text file (csv, tab-delimited, etc) into a database, but I'd like
to check the file before doing anything else just to make sure it
is in fact text and not binary.
In general, there is no way.
However, all is not lost.
A text file is a special case of a binary file consisting of a
sequence of characters whose representations are binary.
For very short files, it is hard to tell. However, if you have some
idea of the pattern you are expecting you can increase your
confidence that some file is binary or text.
Many file formats have magic words and header data that indicate the
type. These provide a hint and an additional check can provide some
confidence. For example, a magic word plus a required element can
identify a .png file, that is, check to see whether it starts with
this: format("\211PNG\r\n\032\n\000\000\000\015IHDR").
Unicode files often have BOM markers at the start, but they are not
required in some cases and the BOM shouldn't be there in others. I
have a function I use to differentiate among Unicode files, but that
already assumes I know they are unicode and even then it has trouble
with some perverse files. (It does get it right more often than
Microsoft programs do.) UTF-8 files also have other limitations
among the characters, so that can help.
Text files should have certain patterns. For example, if the file is
ASCII and is comma-delimited or tab-delimited, there are some
indicators. You should see only certain control characters. You
should see the expected delimiter. You should see either CR or LF or
both. All characters have codes less than 128. You might want to
require the same number of delimiters per line.
So, given some specified pattern of what you expect in binary or
text, you should be able to differentiate.
However, an alternate approach would be to parse the file and if the
file does not pass, then reject it no matter the form of the data.
Dar
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution