I would appreciate any help with improving the query times I'm
experiencing. The database is composed of n-grams drawn from a set of
texts, which are dynamically labelled (some with no label, which are not
counted) and differences and intersections are found between the n-grams
of each labelled set of texts.

The schema is:

CREATE TABLE Text (
            id INTEGER PRIMARY KEY ASC,
            filename TEXT UNIQUE NOT NULL,
            checksum TEXT NOT NULL,
            label TEXT NOT NULL
            );
CREATE TABLE TextNGram (
            text INTEGER NOT NULL REFERENCES Text (id),
            ngram TEXT NOT NULL,
            size INTEGER NOT NULL,
            count INTEGER NOT NULL
            );
CREATE INDEX TextIndex ON Text (id, label);
CREATE INDEX TextNGramIndexNGram ON TextNGram (ngram);
CREATE INDEX TextNGramIndexSize ON TextNGram (size, ngram);

There are over 4000 rows in Text and over a billion in TextNGram. The
slightly unnormalised schema is due to wanting to keep the time spent
inserting all of the data low.

The two queries (one to get a symmetric difference between sets of
Texts' n-grams, the other to get an intersection) are:

SELECT TextNGram.ngram, TextNGram.count freq_count, Text.filename,
Text.label
   FROM TextNGram, Text
   WHERE Text.label IN (%s)
      AND TextNGram.size BETWEEN ? AND ?
      AND TextNGram.text = Text.id
      AND NOT EXISTS
         (SELECT tn.ngram
            FROM TextNGram tn, Text t
            WHERE t.label != Text.label
               AND t.label != ''
               AND tn.text = t.id
               AND tn.ngram = TextNGram.ngram
   ORDER BY TextNGram.size, TextNGram.ngram

and

SELECT TextNgram.ngram, TextNGram.count freq_count, Text.filename,
Text.label
   FROM TextNGram, Text
   WHERE Text.label IN (%s)
      AND Text.id = TextNGram.text
      AND TextNGram.size BETWEEN ? AND ?
      [ The EXISTS sub-select is repeated for each label ]
      AND EXISTS
         (SELECT t.label
            FROM Text t, TextNGram tn
            WHERE t.label = ?
               AND tn.text = t.id
               AND tn.ngram = TextNGram.ngram) 
   ORDER BY TextNGram.size, TextNGram.ngram

I don't know whether my indices can be improved, or the queries
themselves, but currently queries can take hours to run.

Thank you for any help!

Jamie

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to