Re: [sqlite] CSV import using CLI (header, NULL)
On Feb 27, 2020, at 11:51 PM, Christof Böckler wrote: > > 1. There should be no penalty for using header lines in CSV files. Thus a new > flag -h for .import is much appreciated. More than that, SQLite should be able to *actively use* the header when present. For instance, given: foo,bar,qux 1,2,3 then there should be a way to import only the first and third columns, perhaps with a command like so: .import data.csv tablename foo,qux I’ve needed such a thing multiple times. One case that comes to mind is that tablename has only two columns, and you’re purposefully subsetting the CSV data in SQLite. This is common when importing DBs from outside sources, where the other DB has more columns than your local SQLite DB. It’s annoying to need to preprocess the data to strip the “extra” columns out. It’s even more annoying to subset it in SQLite by matching the outside DB’s structure in SQLite and then do the common workaround to a lack of ALTER TABLE DROP COLUMN. Another case I’ve run into before is that the table layout of the CSV is the same as in SQLite, but you want one or more of the columns to get their default value for some reason, not take them from the CSV. You can see this with time-varying data when the CSV is output from SQLite at time t0 and reimported at t1 after some of the data has gone stale, so that letting the stale columns revert to their defaults is better than importing obsolete data. To make the second example concrete, imagine an HVAC monitoring system’s DB: the names of the stations and the wear data are still useful, but the last known temperature shouldn’t be imported because the old data could be entirely inaccurate, and fresh data should be coming along soon after the DB table is restored from the backup. Meanwhile, report “I don’t know” for the temperature measurements. > 2. Missing values in columns with numerical affinity should show up as NULL > values. I don’t think one hard-and-fast rule works for everyone here, if for no other reason than that there must be people depending on the current behavior. Instead, let it be a mode: .blank default .blank blank .blank zero .blank null This proposed setting tells the CSV importer how to handle blanks: use the DB column’s default value, leave them be, make them zeroes, or make them SQL NULLs. The default value for this setting would be “blank” for backwards compatibility. Parallel .zero and .null settings could be added to take care of similar cases. The defaults would obviously be “zero” and “null”, respectively. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] CSV import using CLI (header, NULL)
On Sat, Feb 29, 2020 at 1:42 PM Shawn Wagner wrote: > To import a csv file with headers into an existing table, you can use > .import '| tail -n +2 yourfile.csv' yourtable > to skip the header line. On unix. And by shell’ing out to native tools, so not portable. The cli ought to have something built in, if it doesn’t already. > > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] CSV import using CLI (header, NULL)
To import a csv file with headers into an existing table, you can use .import '| tail -n +2 yourfile.csv' yourtable to skip the header line. On Sat, Feb 29, 2020, 4:30 AM Christof Böckler wrote: > Hi, > > I want to share some thoughts and make some suggestions about the SQLite > 3 command line interface (CLI) tool, especially its behaviour when > importing CSV files. > > CSV files are probably even more common for data exchange than SQLite > database files. I consider it to be good practice to include a header > line with column names in every CSV file. Metadata should go with the > data. This prevents me from mixing up two columns that contain numbers > with similiar value distributions. > > Let’s look at an example. A file named data.csv contains three lines: > > id,val1,val2 > A,27,8 > B,3,12 > > Now … > sqlite3 > sqlite> .import data.csv tab > > works and looks good at first, but there is a problem with numerical data. > > sqlite> SELECT * FROM tab ORDER BY val2; > B,3,12 > A,27,8 > > This is because on import all three columns were created with affinity > (not to say data type) TEXT (see .schema). As a consequence all numbers > were imported as strings. > '12' < '8' is lexicographically OK, but not so in a mathematical sense. > Having the CSV file in mind I clearly expect to see 8 on the first line > of the above result. > > How to work around this? Just define the table in advance with > appropriate data types (affinity INTEGER): > > sqlite> CREATE TABLE tab (id TEXT, val1 INT, val2 INT); > > But now the above .import command will not work as expected, because it > will result in three rows in our table tab. The first row contains the > header line. > > Two different workarounds come to my mind: > a) sqlite> DELETE FROM tab WHERE rowid = 1; -- Dirty hack! > b) sqlite> .import data.csv temp_tab > sqlite> INSERT INTO tab SELECT * FROM temp_tab; > sqlite> DROP TABLE temp_tab; -- Two tedious extra lines > > Both approaches are not very appealing to me. To make CSV files with a > header line first class citizens, I suggest this instead: > sqlite> .import -h data.csv tab > should ignore the first line of data.csv. This import should fail if > table tab does not already exist. > This is both shorter and more elegant than both workarounds. > > > Now on to a second issue. Let’s assume you have sucessfully imported a > file containing these four lines: > id,val1,val2 > A,27,8 > B,3,12 > C,,1 > into the table tab mentioned above, resulting in three rows. Notice the > missing value in column val1 on the last line. This missing value is > imported as an empty string '' regardlesse of the affinity of column val1. > > That leads to (mathematically) unexpected results from aggregate functions: > sqlite> SELECT SUM(val1) FROM tab; > 30-- Well, treating '' like 0 is OK in this case > sqlite> SELECT COUNT(val1) FROM tab; > 3 -- but not so here; only two values/numbers were given in data.csv > sqlite> SELECT AVG(val1) FROM tab; > 10-- the average of 3 and 27 is 15 > sqlite> SELECT MAX(val1) FROM tab; > ''-- not to be expected when looking at data.csv > > OK, I hear you say, what about this workaround: > sqlite> UPDATE tab SET val1 = NULL WHERE val1 = ''; > This makes the surprises above go away, but it is again tedious to do for > all columns containing only numbers. > > My suggestion is: If someone goes the extra mile and defines a table in > advance in order to have an appropriate numerical affinity (INTEGER, REAL > or NUMERIC) for a column, then it is OK to encode a missing value as NULL > instead of ''. It seems right though to keep the current behaviour for > columns with affinity TEXT (the default) or BLOB. > > To sum things up: > 1. There should be no penalty for using header lines in CSV files. Thus a > new flag -h for .import is much appreciated. > 2. Missing values in columns with numerical affinity should show up as > NULL values. > > Thanks for reading, I look forward to your opinions about these issues. > > Greetings > Christof > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] CSV import using CLI (header, NULL)
Hi, I want to share some thoughts and make some suggestions about the SQLite 3 command line interface (CLI) tool, especially its behaviour when importing CSV files. CSV files are probably even more common for data exchange than SQLite database files. I consider it to be good practice to include a header line with column names in every CSV file. Metadata should go with the data. This prevents me from mixing up two columns that contain numbers with similiar value distributions. Let’s look at an example. A file named data.csv contains three lines: id,val1,val2 A,27,8 B,3,12 Now … sqlite3 sqlite> .import data.csv tab works and looks good at first, but there is a problem with numerical data. sqlite> SELECT * FROM tab ORDER BY val2; B,3,12 A,27,8 This is because on import all three columns were created with affinity (not to say data type) TEXT (see .schema). As a consequence all numbers were imported as strings. '12' < '8' is lexicographically OK, but not so in a mathematical sense. Having the CSV file in mind I clearly expect to see 8 on the first line of the above result. How to work around this? Just define the table in advance with appropriate data types (affinity INTEGER): sqlite> CREATE TABLE tab (id TEXT, val1 INT, val2 INT); But now the above .import command will not work as expected, because it will result in three rows in our table tab. The first row contains the header line. Two different workarounds come to my mind: a) sqlite> DELETE FROM tab WHERE rowid = 1; -- Dirty hack! b) sqlite> .import data.csv temp_tab sqlite> INSERT INTO tab SELECT * FROM temp_tab; sqlite> DROP TABLE temp_tab; -- Two tedious extra lines Both approaches are not very appealing to me. To make CSV files with a header line first class citizens, I suggest this instead: sqlite> .import -h data.csv tab should ignore the first line of data.csv. This import should fail if table tab does not already exist. This is both shorter and more elegant than both workarounds. Now on to a second issue. Let’s assume you have sucessfully imported a file containing these four lines: id,val1,val2 A,27,8 B,3,12 C,,1 into the table tab mentioned above, resulting in three rows. Notice the missing value in column val1 on the last line. This missing value is imported as an empty string '' regardlesse of the affinity of column val1. That leads to (mathematically) unexpected results from aggregate functions: sqlite> SELECT SUM(val1) FROM tab; 30-- Well, treating '' like 0 is OK in this case sqlite> SELECT COUNT(val1) FROM tab; 3 -- but not so here; only two values/numbers were given in data.csv sqlite> SELECT AVG(val1) FROM tab; 10-- the average of 3 and 27 is 15 sqlite> SELECT MAX(val1) FROM tab; ''-- not to be expected when looking at data.csv OK, I hear you say, what about this workaround: sqlite> UPDATE tab SET val1 = NULL WHERE val1 = ''; This makes the surprises above go away, but it is again tedious to do for all columns containing only numbers. My suggestion is: If someone goes the extra mile and defines a table in advance in order to have an appropriate numerical affinity (INTEGER, REAL or NUMERIC) for a column, then it is OK to encode a missing value as NULL instead of ''. It seems right though to keep the current behaviour for columns with affinity TEXT (the default) or BLOB. To sum things up: 1. There should be no penalty for using header lines in CSV files. Thus a new flag -h for .import is much appreciated. 2. Missing values in columns with numerical affinity should show up as NULL values. Thanks for reading, I look forward to your opinions about these issues. Greetings Christof ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users