Hi,
I want to share some thoughts and make some suggestions about the SQLite
3 command line interface (CLI) tool, especially its behaviour when
importing CSV files.
CSV files are probably even more common for data exchange than SQLite
database files. I consider it to be good practice to include a header
line with column names in every CSV file. Metadata should go with the
data. This prevents me from mixing up two columns that contain numbers
with similiar value distributions.
Let’s look at an example. A file named data.csv contains three lines:
id,val1,val2
A,27,8
B,3,12
Now …
sqlite3
sqlite> .import data.csv tab
works and looks good at first, but there is a problem with numerical data.
sqlite> SELECT * FROM tab ORDER BY val2;
B,3,12
A,27,8
This is because on import all three columns were created with affinity
(not to say data type) TEXT (see .schema). As a consequence all numbers
were imported as strings.
'12' < '8' is lexicographically OK, but not so in a mathematical sense.
Having the CSV file in mind I clearly expect to see 8 on the first line
of the above result.
How to work around this? Just define the table in advance with
appropriate data types (affinity INTEGER):
sqlite> CREATE TABLE tab (id TEXT, val1 INT, val2 INT);
But now the above .import command will not work as expected, because it
will result in three rows in our table tab. The first row contains the
header line.
Two different workarounds come to my mind:
a) sqlite> DELETE FROM tab WHERE rowid = 1; -- Dirty hack!
b) sqlite> .import data.csv temp_tab
sqlite> INSERT INTO tab SELECT * FROM temp_tab;
sqlite> DROP TABLE temp_tab; -- Two tedious extra lines
Both approaches are not very appealing to me. To make CSV files with a
header line first class citizens, I suggest this instead:
sqlite> .import -h data.csv tab
should ignore the first line of data.csv. This import should fail if
table tab does not already exist.
This is both shorter and more elegant than both workarounds.
Now on to a second issue. Let’s assume you have sucessfully imported a
file containing these four lines:
id,val1,val2
A,27,8
B,3,12
C,,1
into the table tab mentioned above, resulting in three rows. Notice the
missing value in column val1 on the last line. This missing value is
imported as an empty string '' regardlesse of the affinity of column val1.
That leads to (mathematically) unexpected results from aggregate functions:
sqlite> SELECT SUM(val1) FROM tab;
30 -- Well, treating '' like 0 is OK in this case
sqlite> SELECT COUNT(val1) FROM tab;
3 -- but not so here; only two values/numbers were given in data.csv
sqlite> SELECT AVG(val1) FROM tab;
10 -- the average of 3 and 27 is 15
sqlite> SELECT MAX(val1) FROM tab;
'' -- not to be expected when looking at data.csv
OK, I hear you say, what about this workaround:
sqlite> UPDATE tab SET val1 = NULL WHERE val1 = '';
This makes the surprises above go away, but it is again tedious to do for
all columns containing only numbers.
My suggestion is: If someone goes the extra mile and defines a table in
advance in order to have an appropriate numerical affinity (INTEGER, REAL
or NUMERIC) for a column, then it is OK to encode a missing value as NULL
instead of ''. It seems right though to keep the current behaviour for
columns with affinity TEXT (the default) or BLOB.
To sum things up:
1. There should be no penalty for using header lines in CSV files. Thus a
new flag -h for .import is much appreciated.
2. Missing values in columns with numerical affinity should show up as
NULL values.
Thanks for reading, I look forward to your opinions about these issues.
Greetings
Christof
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users