I have a command pipeline that I would like to transfer to sqlite. However, the performance under sqlite is significantly slower, which makes me think that I am doing something not quite correctly. How can I improve the performance of this task using sqlite?
foo.tsv is a tab-delimited file with 22 fields with over 5 MM records. $ time -p wc -l foo.tsv 5603674 foo.tsv real 0.49 user 0.34 sys 0.15 I need an aggregate count based on the values in fields 9 and 22. Here is my current pipeline: $ time -p cat foo.tsv | cut -f9,22 | sort | uniq -c | perl -lane 'print join("\t", @F[1,2,0])' | wc -l 1992301 real 15.69 user 15.63 sys 1.07 Here is my first attempt at an equivalent pipeline in sqlite3: $ time -p { grep -v '^#' <<eof create table foo ( col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18, col_19, col_20, col_21, col_22 ) ; .mode tab .imp "foo.tsv" "foo" select col_9, col_22, count(*) as "count" from foo group by col_9, col_22 ; eof } | sqlite3 foo.sqlite | wc -l That was running in excess of 30 minutes before I killed it. Here's my second attempt, pointing the tempfolder to shared memory: $ time -p { grep -v '^#' <<eof drop table if exists foo ; create table foo ( col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18, col_19, col_20, col_21, col_22 ) ; .mode tab .imp "foo.tsv" "foo" select col_9, col_22, count(*) as "count" from foo group by col_9, col_22 ; eof } | TMPDIR=/dev/shm/ sqlite3 foo.sqlite | wc -l 1992301 real 105.95 user 80.02 sys 20.59 That's about seven times longer than the existing pipeline. Any ideas on how to speed up the sqlite process? Thanks in advance for any pointers. Regards, - Robert _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users