On Sun, Sep 14, 2008 at 11:01 AM, Holger Lembke <[EMAIL PROTECTED]> wrote:
> (Hello all, new user... refreshing my rusty sql knowledge)
>
> I have a simple table created with
>
>  create virtual table ft using fts3(id, content);
>
> and store id/text-pairs with
>
>  delete from ft where (id="theid");

Note that this will involve a full table scan to find the row with
id="theid".  fts3 tables have an index on tokens, and an implicit
unique index on docid/rowid, and this can use neither.

If having a text id is important, you could do something like:

CREATE TABLE ft_id(docid INTEGER PRIMARY KEY, id TEXT UNIQUE);
CREATE VIRTUAL TABLE ft USING fts3(content);

Then inserts would be something like:

INSERT INTO ft_id (docid, id) VALUES (null, ?); -- bind "theid" or whatever
INSERT INTO ft (docid, content) VALUES (LAST_INSERT_ROWID(), ?); --
bind "sometext" or whatever

and delete would be something like:

DELETE FROM ft WHERE docid = (SELECT docid FROM ft_id WHERE id = ?);
-- bind "theid" or whatever
DELETE FROM ft_id WHERE id = ?;

>  insert into ft values ("theid", "sometext");
>
> To test the table I have 400 files containing old mails. Just an random
> choice of contents. Ids are internally counted up simply, so it is 1 to 400.
> I remove ":" and cr/lf from mails but leave everything else. But the effect
> seems to be content independant.
>
> If I repeat this process multiple times I get a growing database. select
> count(*) from ft and ft_content stays at 400, ft_segdir stays at 30.
>
> But select count(*) from ft_segments is raising, from 756, 1237 to 2100 and
> so on. Ok, that explains the growing of my database file.

You're seeing two effects.

For performance reasons, fts3 stores new updates in new segments (in
ft_segments), and over time merges them together.  So index
information doesn't always immediately change to reflect "truth".
Over time as additional updates happen, older segments are merged and
older data is dropped (as newer data supercedes it), so you should
eventually see things stabilize.

Additionally, fts3 implements delete as a sort of negative posting
list.  Unfortunately, when implementing things I was not able to think
of a clear way to handle negative posting lists supercede positive
posting lists while still maintaining correctness across segments.  I
have an implementation which implements this, but haven't polished it
up enough to check it in.

Over the summer I added an fts3 function called optimize() which can
be used to deal with some of this.  If you do something like this:

   SELECT optimize(ft) FROM ft LIMIT 1;

fts3 will combine all index data into a single segment, and drop
unnecessary negative posting lists.  Also, if you happen to delete all
data from your fts3 table, the index will be cleared out.

Long-term, the ability to have negative posting lists annihilate
positive posting lists is obviously the better solution.  I had hoped
to get it in last spring, but unfortunately I keep getting distracted
by other stuff.  The main sticking point was that I also wanted to
build an fts_migrate facility to make it easy to migrate data between
differently-defined fts tables (either between fts3 and fts4, or
between fts tables with different schema).  No estimate as to
completion point for that.

-scott
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to