> > CREATE TABLE genotypes(
> >     markerid integer NOT NULL REFERENCES marker(id),
> >     individualid integer NOT NULL REFERENCES individuals(id),
> >     genA integer,
> >     genB integer,
> >     UNIQUE(markerid, individualid));
> >
> > CREATE INDEX genotypeidx ON genotypes(markerid, individualid);
> >
>
> [...] So if you are inserting into both a large table
> and a large index, the index insertion time dominates.
>
> In the schema above, you have created two identical indices.

Thanks for pointing this out. I actually found this a few hours ago by
checking the sqlite_master table. Since I prefer explicit statements
over implicit indices, I removed the unique contrained from the table
definition and added it to the index as suggested.


> We have also found that inserts go faster still if you do the
> table inserts first, then after the table is fully populated
> create the index.  If you are actively using the uniqueness
> constraint to reject redundant entries, this approach will not
> work for you - you will need to specify the uniqueness constraint
> and thus create the index before any data has been inserted.
We decided to employ the uniqueness contraint as additional quality
measure. It sometimes happens that individuals are assigned multiple
genotypes at the same locus or something similar. Adding the index
afterwards will detect such errors but it will be quite difficult to
find them in the original data files (and to report them back).

Instead we opt for another approach:
The initial marker files (>= 100.000)  are translated into SQL INSERT
statements. If the file is valid (not malformed), the statements are
written to stdout and piped into another application that reads and
passes them to sqlite (for documentation and replay purpose, one could
also "tee" to a log). Since all INSERTs are INSERT OR ROLLBACK,  the
database will never ever be tainted with partially commited input
files. If there are errors, the files are checked and INSERTed again
en block.


> Anything you can do to improve locality of reference while
> inserting data into the database will help.  If you are
> inserting all markers for the same individual at once, then
> you will do better to create your index as
>
>     CREATE UNIQUE INDEX idx ON genotypes(individualid, markerid)
>
> rather than the other way around.

That's something to keep in mind. I will give it a try, thanks =)


> Finally, parsing a few million INSERT statements is very fast
> in SQLite but it still takes time.  You can increase the speed
> by a factor of 3 or more by using prepared statements, if you
> are not already.
As I understand it, sqlite_prepare() and friends will speed up
statements that can be reused. After the initial parsing of the data
file(s), all I got is a looong string of INSERT statements. Then, each
statement is used exactly once.


Another question that arose today:
Is there any penalty for switching tables during INSERTs within the
same COMMIT? E.g.

    BEGIN;
    INSERT INTO tableA VALUES ...;
    INSERT INTO tableB VALUES ...;
    INSERT INTO tableA VALUES ...;
    INSERT INTO tableB VALUES ...;
     :
    COMMIT;

opposed to

    BEGIN;
    INSERT INTO tableA VALUES ...;
    INSERT INTO tableA VALUES ...;
    INSERT INTO tableA VALUES ...;
      :
    COMMIT;
    BEGIN;
    INSERT INTO tableB VALUES ...;
    INSERT INTO tableB VALUES ...;
    INSERT INTO tableB VALUES ...;
      :
    COMMIT;

Yesterday I did the former, it seemed to take ages. Today I use the
latter ... it seems to be faster?!


Many thanks for your replies, everyone =)

    Daniel

Reply via email to