[sqlite] large table performance

Daniel Franke Thu, 02 Mar 2006 16:15:14 -0800

Hi all.

I spent the last days bragging that a single database file as provided by 
sqlite is a far better approach to store data than -literally- thousands of 
flat files. Now, I got a small amount of testing data an wow ... I'm stuck.


Area: Bioinformatics. Imagine a matrix of data: genetic marker names 
(attribute A) in columns and individuals (attribute B) in rows. Since the 
number of features per attribute varies between projects, I decided to create 
three tables:

 * Table markers:     the genetic markers (attribute A), e.g. 100.000 rows
 * Table individuals: individual ids (attribute B), e.g. 1.000 rows
 * Table genotypes:   the genetic data

Tables "markers" and "individuals" have 2 and 6 columns respectively, a unique 
primary key, and the (basically) the name of the feature, "genotypes" holds 
foreign keys to "markers"/"individuals" respectively as well as the genotype 
column(s), see below.

Genotypes are inserted by:

INSERT OR ROLLBACK INTO genotypes VALUES ((SELECT id FROM marker WHERE 
name='$markername$'), (SELECT id FROM individuals WHERE pedigree='$pedigree$' 
AND person='$person$'), $genA$, $genB$);

Where $markername$, ..., $genB$ are replaced with the appropiate values.

Given the schema below, feeding a million INSERTs into the database by 
sqlite3_exec() takes about 30 minutes (this includes transactions, indices 
and "PRAGMA synchronous=off" as well as a fully optimized build of sqlite). 

Is there any chance to speed this up? Production datasets could easily bring a 
billion genotypes ...


Any pointer would be appreciated!

With kind regards

        Daniel Franke


--

The database schema:


CREATE TABLE marker (
    id integer PRIMARY KEY AUTOINCREMENT,
    name varchar UNIQUE);

CREATE INDEX markernameidx on marker(name);


CREATE TABLE individuals (
    id integer PRIMARY KEY AUTOINCREMENT,
    pedigree varchar NOT NULL,
    person varchar NOT NULL,
    father varchar,
    mother varchar,
    sex integer NOT NULL,
    UNIQUE(pedigree, person));

CREATE INDEX individualidx ON individuals (pedigree, person);


CREATE TABLE genotypes(
    markerid integer NOT NULL REFERENCES marker(id),
    individualid integer NOT NULL REFERENCES individuals(id),
    genA integer,
    genB integer,
    UNIQUE(markerid, individualid));

CREATE INDEX genotypeidx ON genotypes(markerid, individualid);

[sqlite] large table performance

Reply via email to