Hi,

I want to share some thoughts and make some suggestions about the SQLite 3 command line interface (CLI) tool, especially its behaviour when importing CSV files.

CSV files are probably even more common for data exchange than SQLite database files. I consider it to be good practice to include a header line with column names in every CSV file. Metadata should go with the data. This prevents me from mixing up two columns that contain numbers with similiar value distributions.

Let’s look at an example. A file named data.csv contains three lines:

id,val1,val2
A,27,8
B,3,12

Now …
sqlite3
sqlite> .import data.csv tab

works and looks good at first, but there is a problem with numerical data.

sqlite> SELECT * FROM tab ORDER BY val2;
B,3,12
A,27,8

This is because on import all three columns were created with affinity (not to say data type) TEXT (see .schema). As a consequence all numbers were imported as strings. '12' < '8' is lexicographically OK, but not so in a mathematical sense. Having the CSV file in mind I clearly expect to see 8 on the first line of the above result.

How to work around this? Just define the table in advance with appropriate data types (affinity INTEGER):

sqlite> CREATE TABLE tab (id TEXT, val1 INT, val2 INT);

But now the above .import command will not work as expected, because it will result in three rows in our table tab. The first row contains the header line.

Two different workarounds come to my mind:
a) sqlite> DELETE FROM tab WHERE rowid = 1;  -- Dirty hack!
b) sqlite> .import data.csv temp_tab
   sqlite> INSERT INTO tab SELECT * FROM temp_tab;
   sqlite> DROP TABLE temp_tab;  -- Two tedious extra lines

Both approaches are not very appealing to me. To make CSV files with a header line first class citizens, I suggest this instead:
sqlite> .import -h data.csv tab
should ignore the first line of data.csv. This import should fail if table tab does not already exist.
This is both shorter and more elegant than both workarounds.


Now on to a second issue. Let’s assume you have sucessfully imported a file containing these four lines:
id,val1,val2
A,27,8
B,3,12
C,,1
into the table tab mentioned above, resulting in three rows. Notice the missing value in column val1 on the last line. This missing value is imported as an empty string '' regardlesse of the affinity of column val1.

That leads to (mathematically) unexpected results from aggregate functions:
sqlite> SELECT SUM(val1) FROM tab;
30    -- Well, treating '' like 0 is OK in this case
sqlite> SELECT COUNT(val1) FROM tab;
3     -- but not so here; only two values/numbers were given in data.csv
sqlite> SELECT AVG(val1) FROM tab;
10    -- the average of 3 and 27 is 15
sqlite> SELECT MAX(val1) FROM tab;
''    -- not to be expected when looking at data.csv

OK, I hear you say, what about this workaround:
sqlite> UPDATE tab SET val1 = NULL WHERE val1 = '';
This makes the surprises above go away, but it is again tedious to do for all columns containing only numbers.

My suggestion is: If someone goes the extra mile and defines a table in advance in order to have an appropriate numerical affinity (INTEGER, REAL or NUMERIC) for a column, then it is OK to encode a missing value as NULL instead of ''. It seems right though to keep the current behaviour for columns with affinity TEXT (the default) or BLOB.

To sum things up:
1. There should be no penalty for using header lines in CSV files. Thus a new flag -h for .import is much appreciated. 2. Missing values in columns with numerical affinity should show up as NULL values.

Thanks for reading, I look forward to your opinions about these issues.

Greetings
Christof
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to