I would appreciate any help with improving the query times I'm
experiencing. The database is composed of n-grams drawn from a set of
texts, which are dynamically labelled (some with no label, which are not
counted) and differences and intersections are found between the n-grams
of each labelled set of texts.
The schema is:
CREATE TABLE Text (
id INTEGER PRIMARY KEY ASC,
filename TEXT UNIQUE NOT NULL,
checksum TEXT NOT NULL,
label TEXT NOT NULL
);
CREATE TABLE TextNGram (
text INTEGER NOT NULL REFERENCES Text (id),
ngram TEXT NOT NULL,
size INTEGER NOT NULL,
count INTEGER NOT NULL
);
CREATE INDEX TextIndex ON Text (id, label);
CREATE INDEX TextNGramIndexNGram ON TextNGram (ngram);
CREATE INDEX TextNGramIndexSize ON TextNGram (size, ngram);
There are over 4000 rows in Text and over a billion in TextNGram. The
slightly unnormalised schema is due to wanting to keep the time spent
inserting all of the data low.
The two queries (one to get a symmetric difference between sets of
Texts' n-grams, the other to get an intersection) are:
SELECT TextNGram.ngram, TextNGram.count freq_count, Text.filename,
Text.label
FROM TextNGram, Text
WHERE Text.label IN (%s)
AND TextNGram.size BETWEEN ? AND ?
AND TextNGram.text = Text.id
AND NOT EXISTS
(SELECT tn.ngram
FROM TextNGram tn, Text t
WHERE t.label != Text.label
AND t.label != ''
AND tn.text = t.id
AND tn.ngram = TextNGram.ngram
ORDER BY TextNGram.size, TextNGram.ngram
and
SELECT TextNgram.ngram, TextNGram.count freq_count, Text.filename,
Text.label
FROM TextNGram, Text
WHERE Text.label IN (%s)
AND Text.id = TextNGram.text
AND TextNGram.size BETWEEN ? AND ?
[ The EXISTS sub-select is repeated for each label ]
AND EXISTS
(SELECT t.label
FROM Text t, TextNGram tn
WHERE t.label = ?
AND tn.text = t.id
AND tn.ngram = TextNGram.ngram)
ORDER BY TextNGram.size, TextNGram.ngram
I don't know whether my indices can be improved, or the queries
themselves, but currently queries can take hours to run.
Thank you for any help!
Jamie
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users