I have a command pipeline that I would like to transfer to sqlite.
However, the performance under sqlite is significantly slower, which
makes me think that I am doing something not quite correctly.  How can
I improve the performance of this task using sqlite?

foo.tsv is a tab-delimited file with 22 fields with over 5 MM records.

$ time -p wc -l foo.tsv
5603674 foo.tsv
real 0.49
user 0.34
sys 0.15

I need an aggregate count based on the values in fields 9 and 22.
Here is my current pipeline:

$ time -p cat foo.tsv | cut -f9,22 | sort | uniq -c | perl -lane
'print join("\t", @F[1,2,0])' | wc -l
1992301
real 15.69
user 15.63
sys 1.07

Here is my first attempt at an equivalent pipeline in sqlite3:

$ time -p {  grep -v '^#' <<eof
create table foo (
 col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
 col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17,
 col_18, col_19, col_20, col_21, col_22 ) ;
.mode tab
.imp "foo.tsv" "foo"
select col_9, col_22, count(*) as "count" from foo group by col_9, col_22 ;
eof
} |  sqlite3 foo.sqlite | wc -l

That was running in excess of 30 minutes before I killed it.

Here's my second attempt, pointing the tempfolder to shared memory:

$ time -p {
grep -v '^#' <<eof
drop table if exists foo ;
create table foo (
 col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
 col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17,
 col_18, col_19, col_20, col_21, col_22 ) ;
.mode tab
.imp "foo.tsv" "foo"
select col_9, col_22, count(*) as "count" from foo group by col_9, col_22 ;
eof
} |  TMPDIR=/dev/shm/ sqlite3 foo.sqlite | wc -l
1992301
real 105.95
user 80.02
sys 20.59

That's about seven times longer than the existing pipeline.  Any ideas
on how to speed up the sqlite process?

Thanks in advance for any pointers.

Regards,
- Robert
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to