Hi all. I spent the last days bragging that a single database file as provided by sqlite is a far better approach to store data than -literally- thousands of flat files. Now, I got a small amount of testing data an wow ... I'm stuck.
Area: Bioinformatics. Imagine a matrix of data: genetic marker names (attribute A) in columns and individuals (attribute B) in rows. Since the number of features per attribute varies between projects, I decided to create three tables: * Table markers: the genetic markers (attribute A), e.g. 100.000 rows * Table individuals: individual ids (attribute B), e.g. 1.000 rows * Table genotypes: the genetic data Tables "markers" and "individuals" have 2 and 6 columns respectively, a unique primary key, and the (basically) the name of the feature, "genotypes" holds foreign keys to "markers"/"individuals" respectively as well as the genotype column(s), see below. Genotypes are inserted by: INSERT OR ROLLBACK INTO genotypes VALUES ((SELECT id FROM marker WHERE name='$markername$'), (SELECT id FROM individuals WHERE pedigree='$pedigree$' AND person='$person$'), $genA$, $genB$); Where $markername$, ..., $genB$ are replaced with the appropiate values. Given the schema below, feeding a million INSERTs into the database by sqlite3_exec() takes about 30 minutes (this includes transactions, indices and "PRAGMA synchronous=off" as well as a fully optimized build of sqlite). Is there any chance to speed this up? Production datasets could easily bring a billion genotypes ... Any pointer would be appreciated! With kind regards Daniel Franke -- The database schema: CREATE TABLE marker ( id integer PRIMARY KEY AUTOINCREMENT, name varchar UNIQUE); CREATE INDEX markernameidx on marker(name); CREATE TABLE individuals ( id integer PRIMARY KEY AUTOINCREMENT, pedigree varchar NOT NULL, person varchar NOT NULL, father varchar, mother varchar, sex integer NOT NULL, UNIQUE(pedigree, person)); CREATE INDEX individualidx ON individuals (pedigree, person); CREATE TABLE genotypes( markerid integer NOT NULL REFERENCES marker(id), individualid integer NOT NULL REFERENCES individuals(id), genA integer, genB integer, UNIQUE(markerid, individualid)); CREATE INDEX genotypeidx ON genotypes(markerid, individualid);