> > At first I was just doing something like this pseducode:
> > update_counter(k1, k2, count=count+1, expires=now+count*1day)
> > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)
> 
> Might I suggest that instead of trying to store an ever-changing value, you 
> simply figure it out when it's needed ? I don't quite understand the logic 
> you're applying to calculate your 'expires' value but I think it could be 
> calculated with a SELECT whenever you needed it rather than being stored.


The logic is, "keep a given (k1, k2) pair around for one day for each time it's 
been seen". I could calculate it when it's needed, but since it's a computed 
value, I couldn't then have an index on it. That said, I've removed this in 
several tests and it doesn't appear have any bearing on the performance issue.

> > Now the code that inserts all of the rows into the memory table executes 
> > nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the 
> > journal (in either rollback or wal mode) balloons to over 300mb in size.
> You don't list any indexes which would help with your WHERE clauses, so I 
> suspect SQLite is having to look through all the records in 'counters' in 
> order to find the rows it needs for each COALESCE. The large size of the 
> journal is because you are replacing every row in the databases.


Maybe I wasn't clear, the  ---cut here--- bit is in addition to the existing 
schema (after all, the INSERT OR REPLACE updates it, so surely it must already 
exist).

In the INSERT OR REPLACE operation there is no WHERE clause. (k1, k2) is the 
primary key on the 'counters' table, so the INSERT OR REPLACE takes each value 
out of the temporary trans_counters_v view of the in-memory trans_counters 
temporary table and either replaces or inserts a value for each corresponding 
entry in 'counters'. AFAICT, no amount of indexing here would help. There's no 
reason to index the temporary table, since the sort for the GROUP BY is 
n*log(n), and updating the index for each individual entry would still sum to 
n*log(n). The temporary table is single-use and will have to be scanned every 
time regardless. (And if I isolate that GROUP BY operation, it's blindingly 
fast.). Here's the EXPLAIN QUERY PLAN for the INSERT OR REPLACE:

(0, 0, 0, u'SCAN TABLE trans_counters (132971 rows)')
(0, 0, 0, u'USE TEMP B-TREE FOR GROUP BY')
(0, 0, 0, u'EXECUTE CORRELATED SCALAR SUBQUERY 1')
(1, 0, 0, u'SEARCH TABLE counters USING INDEX sqlite_autoindex_counters_1 (k1=? 
AND k2=?) (~1 rows)')
(0, 0, 0, u'EXECUTE CORRELATED SCALAR SUBQUERY 2')
(2, 0, 0, u'SEARCH TABLE counters USING INDEX sqlite_autoindex_counters_1 (k1=? 
AND k2=?) (~1 rows)')





So no, it's not having to replace every entry in the 'counters' table. Also, if 
it were replacing every row in the database, then the journal would grow to 
equal the (VACUUM'd) size of the database, but it doesn't get that big. It gets 
to 300mb+, but not to the full size of 1.9gb.
 
> > So I feel like something about what I'm doing is fundamentally flawed given 
> > something about sqlite's performance model. All I want is a count of the 
> > number of times that I've seen each pair (k1, k2), is there a better way to 
> > do this without storing them all individually and grouping them later?
> 
> If you have a table with two columns k1, k2 in, and you want to count the 
> times each pair occurs, you can do it in software far faster by having this 
> index and using this SELECT
> CREATE INDEX myTable_keypair ON myTable (k1,k2)
> SELECT k1,k2 from myTable ORDER BY k1,k2
> you might even use one of the following if you know it will always return a 
> unique value for unique keys
> SELECT k1||k2 from myTable ORDER BY k1,k2
> SELECT k1||':'||k2 from myTable ORDER BY k1,k2
> Just count the unique values in your programming language as they go past. 
> Yes, you can use horrendous complication to make SQLite present a neatly 
> formatted return with the counts included, but defining that in SQL makes 
> SQLite do more work than your programming language would need to do.


The problem isn't grouping them to count them. If I evaluate the TEMPORARY VIEW 
that does the GROUP BY clause above it completes in just under a second for all 
100k items in the table it references. That part is very fast. The slow bit is 
incorporating those counts into the versions on disk.



_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to