Re: [sqlite] Help with SQL and index (or schema?)

Kees Nuyt Fri, 27 Mar 2009 10:08:19 -0700

On Fri, 27 Mar 2009 15:53:18 +0100, Jonas Sandman
<[email protected]> wrote:


>Hello,
>
>I have a database with about 137000 * 2 rows with four columns;
>fileid, filename, filepath and istarget.
>It's used to determine if two scanned directories are equal so I run a
>simply query to get the rows that are missing on the target directory
>but do exists in the source directory.
>
>I use this query:
>
>SELECT f.filepath, f.filename FROM files f
>WHERE f.istarget=0
>AND NOT EXISTS (SELECT * FROM files WHERE filepath=f.filepath AND
>filename=f.filename AND istarget=1)

In the sub-SELECT you use * so you get all columns. That's
not necessary. 

SELECT f.filepath, f.filename 
  FROM files f
 WHERE f.istarget=0
 AND NOT EXISTS (
   SELECT ROWID FROM files 
    WHERE filepath=f.filepath
      AND filename=f.filename
      AND istarget=1
);

It might be better to rewrite as a self-join using LEFT
OUTER JOIN on filename and filepath, using an NULL istarget
from one or the other alias as an indication that target
instance of the file is missing.

Something like:

SELECT source.filepath, source.filename
  FROM files AS source
  LEFT OUTER JOIN files AS target
  USING (filepath,filename)
  WHERE source.istarget = 0
    AND target.istarget IS NULL
  ORDER BY whatever you like;

(untested)

>and I have experimented with some index to improve the speed, both
>index (filename, filepath, istarget), (filename), (filepath) etc...

The SQLite optimizer can only use one index at a time.
In general, the index should be as selective as possible.
This is known as cardinality.

In your case, (filename, filepath) should do,
unless all files in all directories have the same series of
names, in which (filepath, filename) could be better.
Adding istarget doesn't hurt, it would make the index usable
as a primary key.

If you defined more than one index, you can help the
optimizer to choose the best index by running ANALYZE; on a
database filled with representative data.

>I am still not quite satisfied with the speed (a few seconds to check this).
>Perhaps the table schema itself is the problem? 

You could normalise some more, by creating a second table

CREATE TABLE Pathnames (
        pathid    INTEGER PRIMARY KEY,
        pathname  UNIQUE
);
and referring to its with a foreign key in the file table.

CREATE TABLE Files (
        pathid INTEGER
                CONSTRAINT fk_path REFERENCES Pathnames (pathid) 
                ON INSERT RESTRICT
                ON UPDATE RESTRICT,
        filename TEXT,
        istarget INTEGER,
   PRIMARY KEY (filepathid,filename,istarget)
);

This reduces the overall database size, improves the amount
of unique data in a database page, and above that comparing
integers is faster than strings.

I can't help noticing that comparing two directories in the
same system would cause the source and target paths to be
different by definition (assuming the filepath column
represents the absolute path), but I guess you are aware of
that :)

>Can someone help me?
>
>Regards,
>Jonas
-- 
  (  Kees Nuyt
  )
c[_]
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Help with SQL and index (or schema?)

Reply via email to