[sqlite] CSV import using CLI (header, NULL)

Christof Böckler Sat, 29 Feb 2020 04:31:41 -0800

Hi,

I want to share some thoughts and make some suggestions about the SQLite3 command line interface (CLI) tool, especially its behaviour whenimporting CSV files.

CSV files are probably even more common for data exchange than SQLitedatabase files. I consider it to be good practice to include a headerline with column names in every CSV file. Metadata should go with thedata. This prevents me from mixing up two columns that contain numberswith similiar value distributions.


Let’s look at an example. A file named data.csv contains three lines:

id,val1,val2
A,27,8
B,3,12

Now …
sqlite3
sqlite> .import data.csv tab

works and looks good at first, but there is a problem with numerical data.

sqlite> SELECT * FROM tab ORDER BY val2;
B,3,12
A,27,8

This is because on import all three columns were created with affinity(not to say data type) TEXT (see .schema). As a consequence all numberswere imported as strings.'12' < '8' is lexicographically OK, but not so in a mathematical sense.Having the CSV file in mind I clearly expect to see 8 on the first lineof the above result.

How to work around this? Just define the table in advance withappropriate data types (affinity INTEGER):


sqlite> CREATE TABLE tab (id TEXT, val1 INT, val2 INT);

But now the above .import command will not work as expected, because itwill result in three rows in our table tab. The first row contains theheader line.


Two different workarounds come to my mind:
a) sqlite> DELETE FROM tab WHERE rowid = 1;  -- Dirty hack!
b) sqlite> .import data.csv temp_tab
   sqlite> INSERT INTO tab SELECT * FROM temp_tab;
   sqlite> DROP TABLE temp_tab;  -- Two tedious extra lines

Both approaches are not very appealing to me. To make CSV files with aheader line first class citizens, I suggest this instead:

sqlite> .import -h data.csv tab

should ignore the first line of data.csv. This import should fail iftable tab does not already exist.

This is both shorter and more elegant than both workarounds.

Now on to a second issue. Let’s assume you have sucessfully imported afile containing these four lines:

id,val1,val2
A,27,8
B,3,12
C,,1

into the table tab mentioned above, resulting in three rows. Notice themissing value in column val1 on the last line. This missing value isimported as an empty string '' regardlesse of the affinity of column val1.


That leads to (mathematically) unexpected results from aggregate functions:
sqlite> SELECT SUM(val1) FROM tab;
30    -- Well, treating '' like 0 is OK in this case
sqlite> SELECT COUNT(val1) FROM tab;
3     -- but not so here; only two values/numbers were given in data.csv
sqlite> SELECT AVG(val1) FROM tab;
10    -- the average of 3 and 27 is 15
sqlite> SELECT MAX(val1) FROM tab;
''    -- not to be expected when looking at data.csv

OK, I hear you say, what about this workaround:
sqlite> UPDATE tab SET val1 = NULL WHERE val1 = '';

This makes the surprises above go away, but it is again tedious to do forall columns containing only numbers.

My suggestion is: If someone goes the extra mile and defines a table inadvance in order to have an appropriate numerical affinity (INTEGER, REALor NUMERIC) for a column, then it is OK to encode a missing value as NULLinstead of ''. It seems right though to keep the current behaviour forcolumns with affinity TEXT (the default) or BLOB.


To sum things up:

1. There should be no penalty for using header lines in CSV files. Thus anew flag -h for .import is much appreciated.2. Missing values in columns with numerical affinity should show up asNULL values.


Thanks for reading, I look forward to your opinions about these issues.

Greetings
Christof
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] CSV import using CLI (header, NULL)

Reply via email to